Title : Decoding Creditworthiness: Home Credit Default Risk Prediction¶

In the context of the Home Credit Default Risk (HCDR) Kaggle Competition, the objective is to develop a robust predictive model to determine whether a client will successfully repay a loan. Home Credit, a leading financial institution, aims to ensure a positive loan experience for individuals who face challenges in securing loans due to limited or nonexistent credit histories. To achieve this goal, Home Credit leverages a diverse set of alternative data sources, including telco and transactional information, to assess their clients' repayment capabilities.

Challenges¶

  1. Data Quality and Preprocessing
  2. Feature Engineering
  3. Handling Imbalanced Data
  4. Model Selection and Hyperparameter Tuning
  5. Interpretable and Explainable Models
  6. Computational Resources
  7. Overfitting
  8. Time Constraints

Phase - 1 Leader Plan¶

Phase-1 Leader - Naveen Vardhineni¶

Phase Contributor Contribution Details
Phase 1 Naveen Phase Leader
Phase 1 Naveen Data files overview
Phase 1 Anurag Describing the data
Phase 1 Bharath Planning credit assignment
Phase 1 Alexis Git Repo Creation
Phase 1 Anurag Metrics description
Phase 1 Bharath ML Algorithms to be used
Phase 1 Naveen Pipeline description
Phase 1 Alexis Block model of Pipeline
Phase 1 Bharath Gantt Chart preparation
Phase 1 Naveen Submission of Phase 1

Credit Assignment Plan for Phase 1¶

Specific Measurable Achievable Relevant Time-bound Responsible
Provide concise descriptions of key data files in HCDR Include essential details for each file, ensuring clarity in understanding Summarize the essential information for each file accurately Offer relevant details pertinent to data analysis 5 Naveen
Provide a comprehensive and clear description of the dataset, outlining its key features, variables, and structure Cover all relevant aspects of the dataset, ensuring no crucial information is omitted Deliver a detailed yet concise overview Focus on describing data elements essential for analysis 5 Anurag
Clearly define the metrics used for evaluating model performance, specifying each metric's purpose and calculation method Include all relevant evaluation metrics Provide a detailed description of each metric without overwhelming the reader, balancing depth with clarity Focus on metrics directly impacting the project's goals, emphasizing their significance in assessing model accuracy and effectiveness 5 Anurag
Specify the machine learning algorithms to be utilized, including their names and brief descriptions of their functionalities List all selected algorithms, ensuring a comprehensive overview of the diverse techniques chosen for the project Include algorithms feasible for the project scope and dataset, ensuring practicality and relevance Focus on algorithms tailored to address the project's goals, emphasizing their suitability for the specific prediction task 10 Bharath
Outline the steps of the machine learning pipeline, detailing data preprocessing, feature engineering, model selection, and evaluation methods Clearly define each stage of the pipeline, ensuring a complete and coherent overview of the entire process Provide a comprehensive yet concise description, offering a clear understanding of the workflow without unnecessary complexity Focus on the pipeline elements crucial for model development, emphasizing their direct impact on achieving project objectives 5 Naveen
Develop a clear and concise block model and git repo creation Include labeled blocks representing each stage, ensuring a visually comprehensive overview of the pipeline's structure Design an intuitive and easy-to-understand block model, focusing on simplicity and coherence for effective communication of the pipeline workflow Highlight critical stages in the pipeline, ensuring the visual representation aligns with the project's specific objectives and analysis requirements 10 Alexis

Phase - 2 Leader Plan¶

Phase-2 Leader: Bharath Sri Vardhan Veldi¶

Phase Contributor Contribution Details
2 Bharath Phase Leader
2 Alexis, Bharath Exploratory Data Analysis
2 Anurag Pipeline Coding
2 Naveen Running Experimental Pipelines
2 Naveen Planning Credit Assignment
2 Naveen Create Presentation
2 Alexis Making Comparison
2 Anurag Making notes of results from Experiments
2 Bharath Decide Slides for presentation
2 Naveen Gantt chart Preparation
2 Bharath Submission for phase 2

Credit Assignment Plan For Phase 2¶

Specific Task Measurable Outcome Achievable Goals Relevant Information Time-bound Responsible
Exploratory Data Analysis Thorough exploration of dataset with key insights Identify patterns, outliers, and trends in the data Essential for understanding data characteristics 7.5 Alexis, Bharath
Pipeline Coding Successful implementation of the data pipeline Code functionality and structure Foundation for automated data processing 7.5 Anurag
Running Experimental Pipelines Executed pipelines with reproducible results Validate pipeline functionality with sample data Essential for testing and refining the pipeline 4 Naveen
Planning Credit Assignment Detailed plan for assigning credit in the project Develop a clear plan for credit assignment Ensure fair acknowledgment of team contributions 4 Naveen
Create Presentation Completed presentation slides for phase 2 Include key findings, visuals, and insights Communicate results effectively to stakeholders 4 Naveen
Making Comparison Comparative analysis of results from experiments Identify differences and similarities Essential for drawing conclusions and insights 7.5 Alexis
Making Notes of Results from Experiments Comprehensive documentation of experimental outcomes Summarize key findings and observations Essential for future reference and reporting 7.5 Anurag
Decide Slides for Presentation Finalized selection of slides for the presentation Review and choose the most relevant slides Ensure a coherent and impactful presentation 5 Bharath
Gantt Chart Preparation Completed Gantt chart outlining project timelines Identify key milestones and project duration Essential for project management and tracking 4 Naveen
Submission for Phase 2 Submission of all required deliverables for phase 2 Compile and organize all necessary documents Ensure timely completion and project progress 5 Bharath

Phase - 3 Leader Plan¶

Phase-3 Leader: Anurag Nampally¶

Phase Contributor Contribution Details
3 Anurag Phase Leader
3 Naveen Planning Credit Assignment
3 Bharath Polynomial Feature Expansion
3 Alexis Incorporating Domain specific features
3 Naveen Exploratory Modelling of the Data
3 Anurag Model Training
3 Anurag Baseline Modelling with Imbalanced Dataset +Advanced Features
3 Bharath Implementing oversampling with smote
3 Alexis ML models with domain feature inclusion
3 Bharath Hyperparameter tuning
3 Naveen Model Performance Comparision
3 Alexis Recording Results
3 Alexis Syncing the notebook
3 Anurag Presentation creation
3 Bharath Video preparation
3 Naveen Gantt Chat Preparation
3 Anurag Submission of Phase 3

Credit Assignment Plan For Phase 3¶

Specific Task Measurable Outcome Achievable Goals Relevant Information Time-bound Responsible
Phase Leadership Successful coordination and guidance during Phase 2 Provide clear direction and support for team members Ensure effective teamwork and progress 7.5 Anurag
Planning Credit Assignment Detailed plan for assigning credit in the project Develop a clear plan for credit assignment Ensure fair acknowledgment of team contributions 4 Naveen
Polynomial Feature Expansion Implementation of polynomial features in the model Integrate polynomial features for improved modeling Enhance model complexity and predictive power 7.5 Bharath
Incorporating Domain Specific Features Integration of domain-specific features in models Enhance model relevance to the project domain Improve model performance with domain knowledge 7.5 Alexis
Exploratory Modeling of the Data Thorough exploration and initial modeling of data Identify potential patterns and insights Lay the groundwork for subsequent model training 4 Naveen
Model Training Successful training of machine learning models Train models with selected data and features Prepare models for evaluation and testing 7.5 Anurag
Baseline Modeling with Imbalanced Dataset Development of baseline models with imbalanced data Address challenges posed by imbalanced dataset Establish a baseline for comparison and improvement 7.5 Anurag
Implementing Oversampling with SMOTE Integration of SMOTE for oversampling in the models Address class imbalance through oversampling Improve model performance on minority class 7.5 Bharath
ML Models with Domain Feature Inclusion Creation of models incorporating domain features Evaluate models with domain-specific information Enhance model accuracy and relevance 7.5 Alexis
Hyperparameter Tuning Optimization of model hyperparameters Fine-tune models for improved performance Enhance model efficiency and generalization 7.5 Bharath
Model Performance Comparison Comparative analysis of model performance Evaluate and compare models based on metrics Identify the most effective model configurations 7.5 Naveen
Recording Results Comprehensive documentation of experimental outcomes Summarize key findings and observations Essential for future reference and reporting 7.5 Alexis
Syncing the Notebook Synchronization of project notebooks and files Ensure consistency and version control Facilitate collaboration and troubleshooting 4 Alexis
Presentation Creation Development of presentation slides for Phase 2 Communicate key findings and insights effectively Ensure a clear and engaging presentation 7.5 Anurag
Video Preparation Creation of video content for project presentation Compile visuals and narration for the video Enhance communication and project understanding 7.5 Bharath
Gantt Chart Preparation Completed Gantt chart outlining project timelines Identify key milestones and project duration Essential for project management and tracking 4 Naveen
Submission of Phase 3 Submission of all required deliverables for Phase 3 Compile and organize all necessary documents Ensure timely completion and project progress 5 Anurag

Phase - 4 Leader Plan¶

Phase-4 Leader: Alex Perez¶

Phase Contributor Contribution Details
4 Alexis Phase Leader
4 Naveen Planning Credit Assignment
4 Alexis Data preperation for DeepLearning
4 Anurag Single neural network
4 Bharath Deep Neural Network
4 Naveen Define a Loss Function
4 Bharath Building the Model and Training the model
4 Bharath Video presentation Planning
4 Alexis Creating final Repo on Github
4 Anurag Working on Final NoteBook
4 Anurag Presentation Creation
4 Alexis Video Presentation
4 Naveen Gantt Chart Preparation
4 Bharath Submission of Phase 4

Credit Assignment Plan For Phase 4¶

Specific Task Measurable Outcome Achievable Goals Relevant Information Time-bound Responsible
Phase Leadership Successful coordination and guidance during Phase 4 Provide clear direction and support for team members Ensure effective teamwork and progress 7.5 Alexis
Planning Credit Assignment Detailed plan for assigning credit in the project Develop a clear plan for credit assignment Ensure fair acknowledgment of team contributions 4 Naveen
Data Preparation for Deep Learning Well-prepared data for deep learning model Clean, preprocess, and organize data for modeling Facilitate effective training of deep learning models 7.5 Alexis
Single Neural Network Implementation and training of a single neural network Develop and train a neural network model Establish a baseline for more complex models 7.5 Anurag
Deep Neural Network Design and training of a deep neural network model Develop a deep learning model for improved performance Enhance model complexity and predictive power 7.5 Bharath
Define a Loss Function Clear definition of a loss function for model optimization Establish criteria for model training and evaluation Enhance model training efficiency 4 Naveen
Building the Model and Training the Model Successful construction and training of the model Implement the designed model and train it Prepare models for evaluation and testing 7.5 Bharath
Video Presentation Planning Detailed plan for creating a video presentation Outline content, visuals, and narration for the video Ensure a clear and engaging video presentation 7.5 Bharath
Creating Final Repository on GitHub Establishment of the final project repository on GitHub Create a well-organized and documented repository Facilitate collaboration and version control 7.5 Alexis
Working on Final Notebook Compilation and documentation of final project results Summarize key findings and observations Essential for future reference and reporting 7.5 Anurag
Presentation Creation Development of presentation slides for Phase 4 Communicate key findings and insights effectively Ensure a clear and engaging presentation 7.5 Anurag
Video Presentation Creation of video content for Phase 4 presentation Compile visuals and narration for the video Enhance communication and project understanding 7.5 Alexis
Gantt Chart Preparation Completed Gantt chart outlining Phase 4 timelines Identify key milestones and project duration Essential for project management and tracking 4 Naveen
Submission of Phase 4 Submission of all required deliverables for Phase 4 Compile and organize all necessary documents Ensure timely completion and project progress 5 Bharath

Feel free to adapt the details based on your project's specific needs!

Abstract¶

Situation: Home Credit, a leading financial institution, aims to provide improved credit decisions for individuals with limited credit histories. Traditional credit scoring methods often fail to adequately measure a person’s appropriateness for credit, affecting financial inclusion.

Task: Our goal is to develop an accurate predictive model that leverages telcos and other data sources. By participating in the Home Loan Predetermined Risk Project, we strive to create innovative solutions to bridge the gap in credit assessment, and ensure fair and accessible lending to a wider audience.

Action: Using advanced machine learning algorithms and data analytics techniques, we analyze a variety of data types, including applicants’ financial and personal information. Through rigorous feature engineering, model selection, and validation, we build accurate predictive models that distinguish customers’ ability to pay.

Results: The program provides a robust predictive model, empowering home lenders to make informed lending decisions. This model provides a nuanced understanding of applicant credit, promotes financial inclusion and reinforces Home Credit’s mission to provide a positive lending experience for all customers.

Timeframe: Within the scheduled timeline, we actively analyze data, replicate models and deliver credible solutions. Our approach aligns with the competition’s objectives, ensuring timely delivery of outstanding results.

Dataset¶

The dataset for the Home Credit Default Risk project encompasses a rich and diverse collection of financial and personal information about loan applicants. Comprising multiple CSV files, it provides a comprehensive view of borrowers' credit histories and behaviors. The primary application_train.csv and application_test.csv files offer vital insights into applicants' demographic details, such as age, income, education, and family status.

Supplementary files like bureau.csv and previous_application.csv extend the dataset, offering historical data from credit bureaus and past loan applications, respectively. POS_CASH_balance.csv, credit_card_balance.csv, and installments_payments.csv files provide intricate details about applicants' previous loans, including payment histories and installment schedules.

In addition to these core files, telco and transactional data further enrich the dataset. The bureau_balance.csv file provides monthly updates on credits in the applicant's credit bureau accounts, adding granularity to the historical data. The dataset's complexity and depth empower data scientists to conduct in-depth analyses and construct predictive models.

This dataset is a valuable resource for machine learning practitioners, enabling the development of accurate credit risk assessment models. Its multidimensional nature allows for sophisticated feature engineering and exploration, contributing significantly to the competition's aim of enhancing lending decisions and promoting financial inclusion.

The seven different sources of data for the Home Credit Default Risk project:

application_train/applicationtest(application{train|test}.csv) (307k rows and 48k rows):

Main training and testing data containing loan application details at Home Credit. Each loan is represented by a unique row identified by the feature SK_ID_CURR. Includes the TARGET variable indicating 0 for repaid loans and 1 for loans with payment difficulties. bureau (1.7 Million rows):

Data on client's previous credits from other financial institutions(bureau.csv) Each previous credit has a row; one loan in application data can have multiple previous credits. bureau_balance (27 Million rows):

Monthly data on previous credits in bureau, with each row representing a month of a previous credit(bureau_balance.csv) Multiple rows for a single previous credit, indicating credit activity over several months. previous_application (1.6 Million rows):

Records previous loan applications at Home Credit for clients with loans in application data(POS_CASH_balance.csv) Each previous application is represented by a single row identified by SK_ID_PREV. POS_CASH_BALANCE (10 Million rows):

Monthly data on previous point-of-sale or cash loans clients had with Home Credit(credit_card_balance.csv) Each row represents one month of a previous point-of-sale or cash loan, allowing tracking of payment behavior. credit_card_balance (Millions of rows):

Monthly data on previous credit card accounts clients held with Home Credit(installments_payments.csv) Each row indicates one month of credit card balance, offering insights into credit utilization and payment patterns. installments_payment (13.6 Million rows):

Payment history for previous loans at Home Credit, capturing both made and missed payments(HomeCredit_columns_description.csv) Each payment, successful or missed, is represented by a row, providing a detailed record of borrower behavior.

These diverse data sources form the foundation for creating predictive models, allowing in-depth analysis of applicants' credit histories and behaviors. The extensive dataset enables the exploration of various features, contributing to accurate credit risk assessment and enhanced lending decisions.

image.png

Machine Learning Algorithms¶

Logistic Regression:
Description: Logistic Regression is a linear algorithm used for binary classification tasks. It models the probability that an instance belongs to a particular class.
Why ?: Suitable for its simplicity and interpretability. It serves as a baseline model and works well when the relationship between features and the target variable is approximately linear.

Random Forest:
Description: Random Forest is an ensemble method that builds multiple decision trees and merges their predictions. It handles non-linearity, captures complex relationships, and reduces overfitting.
Why ?: Suitable for capturing intricate patterns in the data. Random Forest is robust, performs well on large datasets, and handles both numerical and categorical features effectively.

Neural Networks (Deep Learning):
Description: Neural Networks consist of interconnected nodes (neurons) organized in layers. Deep Learning involves neural networks with multiple hidden layers.
Why ?: Suitable for capturing complex, non-linear relationships in the data. Deep Learning excels in tasks where features are highly abstract or hierarchical, potentially capturing nuanced patterns in credit default behavior.

Gradient Boosting Machines (GBM):
Description: GBM builds multiple decision trees sequentially, correcting errors made by previous models. It combines weak learners to create a strong predictive model.
Why ?: Suitable for improving accuracy and capturing complex patterns. GBM excels in reducing bias and variance, making it powerful for predicting credit default risk.

XGBoost (Extreme Gradient Boosting):
Description: XGBoost is an optimized implementation of gradient boosting, designed for speed and performance. It uses regularization techniques to prevent overfitting.
Why ?: Suitable for large datasets and high-dimensional feature spaces. XGBoost handles missing data efficiently and is known for its high predictive accuracy.

Bagging (Bootstrap Aggregating):
Description: Bagging is an ensemble learning method that builds multiple models on different subsets of the training data, using bootstrap sampling. It aims to reduce overfitting and improve model stability by combining diverse predictions.

Why?: Effective for high-variance models like decision trees, bagging averages or votes on multiple models, providing a more robust and generalized prediction.

Metrics¶

When evaluating the success of a machine learning model in the Home Credit Default Risk project, it's essential to consider both standard metrics commonly used in classification tasks and domain-specific metrics tailored to the specific objectives of predicting loan defaults. Here's a list of metrics that you might use to measure success:

Standard Metrics:¶

A confusion matrix is a table used in classification machine learning to evaluate the performance of a model. It presents a summary of the actual vs. predicted classifications done by a classification algorithm. The matrix has four important metrics:

True Positives (TP): The number of instances correctly predicted as positive. True Negatives (TN): The number of instances correctly predicted as negative. False Positives (FP): The number of instances incorrectly predicted as positive (actually negative). False Negatives (FN): The number of instances incorrectly predicted as negative (actually positive).

  1. Accuracy:

    • Measures the overall correctness of the model's predictions, calculated as (TP + TN) / (TP + TN + FP + FN), where TP = True Positives, TN = True Negatives, FP = False Positives, FN = False Negatives.
  2. Precision:

    • Indicates the proportion of true positive predictions among all positive predictions made by the model, calculated as TP / (TP + FP).
  3. Recall (Sensitivity):

    • Measures the proportion of actual positive instances that were correctly identified by the model, calculated as TP / (TP + FN).
  4. F1-Score:

    • Harmonic mean of precision and recall, providing a balance between the two, calculated as 2 (Precision Recall) / (Precision + Recall).
  5. ROC-AUC (Receiver Operating Characteristic - Area Under Curve):

    • Represents the area under the ROC curve, which plots the true positive rate (sensitivity) against the false positive rate. A higher AUC value indicates a better model performance.
  6. PR-AUC (Precision-Recall Area Under Curve):

    • Represents the area under the precision-recall curve, which plots precision against recall. PR-AUC is particularly useful for imbalanced datasets.

Domain-Specific Metrics:¶

  1. Profit/Loss Metrics:

    • Measure the financial impact of the model's predictions, considering the costs associated with false positives and false negatives. For example, calculating the profit gained from correctly identified good loans and the losses incurred from defaults.
  2. Risk Metrics:

    • Evaluate the model's ability to identify high-risk loans accurately, focusing on minimizing the number of false negatives (missed defaults).
  1. Lift and Gain Charts:

    • Visualize how much better the model performs compared to random chance. Lift charts show the ratio of the model's performance to random selection, while gain charts illustrate the percentage of positive instances captured by the model compared to random selection.
  2. Bad Rate Metrics:

    • Evaluate the model's performance in identifying loans with a high probability of default, focusing on minimizing the bad rate (percentage of bad loans in the selected portfolio).
  3. Stability Metrics:

    • Measure the consistency and stability of the model's predictions over time, ensuring reliable performance across different time periods and datasets.

We may choose combination of these metrics to finally evaluate our model.

Gant Chart for Phase 1¶

image.png

Gant Chart for Phase 2¶

image-2.png

Gant Chart for Phase 3¶

image.png

Gant Chart for Phase 4¶

image-2.png

Pipeline Steps¶

image.png

Certainly! Here's a description of the pipeline steps for implementing the machine learning algorithms mentioned earlier in the context of the Home Credit Default Risk project:

1. Logistic Regression:¶

  • Data Preprocessing:
    • Handle missing values, categorical encoding, and feature scaling if necessary.
  • Feature Selection:
    • Identify relevant features using techniques like correlation analysis or feature importance.
  • Model Training:
    • Train the logistic regression model on the preprocessed dataset.
  • Evaluation:
    • Evaluate the model using appropriate metrics (accuracy, precision, recall, F1-score) and analyze the confusion matrix.
  • Optimization:
    • Fine-tune hyperparameters using techniques like grid search or randomized search for better performance.

2. Gradient Boosting Machines (GBM):¶

  • Data Preprocessing:
    • Handle missing values, encode categorical features, and scale variables.
  • Feature Engineering:
    • Create new features or transform existing ones to capture complex relationships.
  • Model Training:
    • Train a GBM classifier using the preprocessed dataset.
  • Evaluation:
    • Evaluate the model's performance metrics, focusing on accuracy, precision, recall, and F1-score.
  • Optimization:
    • Fine-tune hyperparameters (learning rate, tree depth, subsample, etc.) using techniques like grid search or random search for optimal results.

3. XGBoost (Extreme Gradient Boosting):¶

  • Data Preprocessing:
    • Handle missing values, encode categorical features, and scale variables if required.
  • Feature Engineering:
    • Create relevant features or perform transformations for improved model accuracy.
  • Model Training:
    • Train an XGBoost classifier on the preprocessed dataset.
  • Evaluation:
    • Evaluate the model using metrics like accuracy, precision, recall, and F1-score.
  • Optimization:
    • Perform hyperparameter tuning using techniques like grid search or Bayesian optimization to enhance the model's performance.

4. Neural Networks (Deep Learning):¶

  • Data Preprocessing:
    • Handle missing values, normalize features, and encode categorical variables.
  • Feature Engineering (if applicable):
    • Create additional features or perform dimensionality reduction using techniques like PCA.
  • Neural Network Architecture:
    • Design the neural network architecture with input, hidden, and output layers, specifying the number of nodes and activation functions.
  • Model Training:
    • Train the neural network on the preprocessed dataset, specifying loss functions and optimizers.
  • Evaluation:
    • Evaluate the neural network's performance using metrics like accuracy, precision, recall, and F1-score.
  • Optimization:
    • Tune hyperparameters, adjust network architecture, or apply regularization techniques to enhance model generalization.

image-2.png

5. Bagging (Bootstrap Aggregating):¶

  • Data Preprocessing:
    • No specific preprocessing required; bagging works well with raw data.
  • Feature Engineering (if applicable):
    • Focus on creating diverse subsets through bootstrap sampling; individual models handle variations.
  • Bagging Ensemble:
    • Assemble multiple models with variations in training data; no specific layer design as in neural networks.
  • Model Training:
    • Train each model independently on different bootstrap samples; no explicit loss function as in neural networks.
  • Evaluation:
    • Combine predictions from multiple models; evaluate ensemble performance through metrics like accuracy or F1-score.
  • Optimization:
    • Tune hyperparameters related to ensemble size and sampling for improved model diversity and performance.

6. Random Forest:¶

  • Data Preprocessing:
    • Handle missing values, categorical encoding, and feature scaling if necessary.
  • Feature Selection:
    • Leverage techniques such as SelectKBest, permutation importance, or SHAP values to identify influential features.
  • Model Training:
    • Train the Random Forest model, an ensemble of decision trees, on the preprocessed dataset.
  • Evaluation:
    • Assess model performance using various metrics (accuracy, precision, recall, F1-score) and examine the confusion matrix.
  • Optimization:
    • Fine-tune hyperparameters, such as the number of trees, maximum depth, and minimum samples split, using grid search or randomized search to enhance model effectiveness.

These pipeline steps provide a structured approach to implementing the selected machine learning algorithms, ensuring proper preprocessing, feature engineering, model training, evaluation, and optimization for accurate prediction of loan default risk.

Team Description¶

Name : Anurag Nampally
Email : anampal@iu.edu

image.png

Name : Naveen Rao Vardhieni
Email : nvardhi@iu.edu

image-2.png

Name : Veldi Bharath Sri Vardhan
Email : bhaveldi@iu.edu
image-4.png

Name : Alexis Perez
Email : ap70@iu.edu

image-3.png

Kaggle API setup¶

Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,

! kaggle competitions files home-credit-default-risk

It is quite easy to setup, it takes me less than 15 minutes to finish a submission.

  1. Install library
  • Create a API Token (edit your profile on Kaggle.com); this produces kaggle.json file
  • Put your JSON kaggle.json in the right place
  • Access competition files; make submissions via the command (see examples below)
  • Submit result

For more detailed information on setting the Kaggle API see here and here.

In [1]:
!pip install kaggle
Collecting kaggle
  Downloading kaggle-1.5.16.tar.gz (83 kB)
     ---------------------------------------- 0.0/83.6 kB ? eta -:--:--
     ---- ----------------------------------- 10.2/83.6 kB ? eta -:--:--
     ------------- ------------------------ 30.7/83.6 kB 445.2 kB/s eta 0:00:01
     --------------------------- ---------- 61.4/83.6 kB 550.5 kB/s eta 0:00:01
     -------------------------------------- 83.6/83.6 kB 586.8 kB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: six>=1.10 in c:\users\tanub\anaconda3\lib\site-packages (from kaggle) (1.16.0)
Requirement already satisfied: certifi in c:\users\tanub\anaconda3\lib\site-packages (from kaggle) (2023.11.17)
Requirement already satisfied: python-dateutil in c:\users\tanub\anaconda3\lib\site-packages (from kaggle) (2.8.2)
Requirement already satisfied: requests in c:\users\tanub\anaconda3\lib\site-packages (from kaggle) (2.31.0)
Requirement already satisfied: tqdm in c:\users\tanub\anaconda3\lib\site-packages (from kaggle) (4.65.0)
Requirement already satisfied: python-slugify in c:\users\tanub\anaconda3\lib\site-packages (from kaggle) (5.0.2)
Requirement already satisfied: urllib3 in c:\users\tanub\anaconda3\lib\site-packages (from kaggle) (1.26.16)
Requirement already satisfied: bleach in c:\users\tanub\anaconda3\lib\site-packages (from kaggle) (4.1.0)
Requirement already satisfied: packaging in c:\users\tanub\anaconda3\lib\site-packages (from bleach->kaggle) (23.1)
Requirement already satisfied: webencodings in c:\users\tanub\anaconda3\lib\site-packages (from bleach->kaggle) (0.5.1)
Requirement already satisfied: text-unidecode>=1.3 in c:\users\tanub\anaconda3\lib\site-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\tanub\anaconda3\lib\site-packages (from requests->kaggle) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in c:\users\tanub\anaconda3\lib\site-packages (from requests->kaggle) (3.4)
Requirement already satisfied: colorama in c:\users\tanub\anaconda3\lib\site-packages (from tqdm->kaggle) (0.4.6)
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py): started
  Building wheel for kaggle (setup.py): finished with status 'done'
  Created wheel for kaggle: filename=kaggle-1.5.16-py3-none-any.whl size=110697 sha256=492f8775a031e452ca103a51f5a617d6ff10ba9a9f20b5345652a24f3f07933b
  Stored in directory: c:\users\tanub\appdata\local\pip\cache\wheels\6a\2b\d0\457dd27de499e9423caf738e743c4a3f82886ee6b19f89d5b7
Successfully built kaggle
Installing collected packages: kaggle
Successfully installed kaggle-1.5.16
In [2]:
!dir C:\Users\tanub\Downloads
 Volume in drive C is Windows-SSD
 Volume Serial Number is F43F-BE58

 Directory of C:\Users\tanub\Downloads

12/03/2023  06:54 PM    <DIR>          .
12/03/2023  06:26 PM    <DIR>          ..
12/03/2023  06:54 PM    <DIR>          .ipynb_checkpoints
12/02/2023  08:56 PM     1,095,571,496 Anaconda3-2023.09-0-Windows-x86_64.exe
12/01/2023  08:54 AM         1,375,280 ChromeSetup.exe
12/02/2023  02:56 PM        96,193,312 DiscordSetup.exe
11/30/2023  06:21 PM       616,149,608 Docker Desktop Installer.exe
12/02/2023  08:58 PM        22,916,605 FP_GroupN_HCDR_5Phase3_IPYNB.ipynb
11/30/2023  06:54 PM        60,868,040 Git-2.43.0-64-bit.exe
11/30/2023  09:05 PM           909,828 h12q.pdf
11/30/2023  07:06 PM         1,797,321 HW10_Perceptrons_Linear SVMs-Student (1).html
11/30/2023  09:02 PM         1,905,721 HW10_Perceptrons_Linear SVMs-Student.html
11/30/2023  09:02 PM         1,171,106 HW10_Perceptrons_Linear SVMs-Student.ipynb
12/01/2023  09:23 PM         2,392,761 q13.pdf
12/01/2023  02:51 PM       143,380,856 Teams_windows_x64.exe
12/01/2023  02:59 PM        94,619,344 VSCodeUserSetup-x64-1.84.2.exe
              13 File(s)  2,139,251,278 bytes
               3 Dir(s)  885,452,816,384 bytes free
In [6]:
# Copy kaggle.json to the .kaggle directory
!copy C:\Users\tanub\Downloads\kaggle.json C:\Users\tanub\.kaggle

# Remove inherited permissions and grant read permissions to the file
!icacls C:\Users\tanub\.kaggle\kaggle.json /inheritance:r
!icacls C:\Users\tanub\.kaggle\kaggle.json /grant:r "%username%:RW"
        1 file(s) copied.
processed file: C:\Users\tanub\.kaggle\kaggle.json
Successfully processed 1 files; Failed processing 0 files
processed file: C:\Users\tanub\.kaggle\kaggle.json
Successfully processed 1 files; Failed processing 0 files
In [7]:
!dir C:\Users\tanub\.kaggle
 Volume in drive C is Windows-SSD
 Volume Serial Number is F43F-BE58

 Directory of C:\Users\tanub\.kaggle

12/03/2023  06:57 PM    <DIR>          .
12/03/2023  06:57 PM    <DIR>          ..
12/03/2023  06:55 PM                68 kaggle.json
               1 File(s)             68 bytes
               2 Dir(s)  884,920,803,328 bytes free
In [8]:
! kaggle competitions files home-credit-default-risk
name                                 size  creationDate         
----------------------------------  -----  -------------------  
POS_CASH_balance.csv                375MB  2019-12-11 02:55:35  
sample_submission.csv               524KB  2019-12-11 02:55:35  
HomeCredit_columns_description.csv   37KB  2019-12-11 02:55:35  
installments_payments.csv           690MB  2019-12-11 02:55:35  
bureau_balance.csv                  358MB  2019-12-11 02:55:35  
application_test.csv                 25MB  2019-12-11 02:55:35  
bureau.csv                          162MB  2019-12-11 02:55:35  
previous_application.csv            386MB  2019-12-11 02:55:35  
application_train.csv               158MB  2019-12-11 02:55:35  
credit_card_balance.csv             405MB  2019-12-11 02:55:35  

Dataset and how to download¶

Back ground Home Credit Group¶

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group¶

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset¶

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview¶

The HomeCredit_columns_description.csv acts as a data dictioanry.

There are 7 different sources of data:

  • application_train/application_test (307k rows, and 48k rows): the main training and testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature SK_ID_CURR. The training application data comes with the TARGET indicating 0: the loan was repaid or 1: the loan was not repaid. The target variable defines if the client had payment difficulties meaning he/she had late payment more than X days on at least one of the first Y installments of the loan. Such case is marked as 1 while other all other cases as 0.
  • bureau (1.7 Million rows): data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.
  • bureau_balance (27 Million rows): monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length.
  • previous_application (1.6 Million rows): previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.
  • POS_CASH_BALANCE (10 Million rows): monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
  • credit_card_balance: monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
  • installments_payment (13.6 Million rows): payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.

Table sizes¶

name                       [  rows cols]     MegaBytes         
-----------------------  ------------------  -------
application_train       : [  307,511, 122]:   158MB
application_test        : [   48,744, 121]:   25MB
bureau                  : [ 1,716,428, 17]    162MB
bureau_balance          : [ 27,299,925, 3]:   358MB
credit_card_balance     : [  3,840,312, 23]   405MB
installments_payments   : [ 13,605,401, 8]    690MB
previous_application    : [  1,670,214, 37]   386MB
POS_CASH_balance        : [ 10,001,358, 8]    375MB
In [ ]:
 

image.png

Downloading the files via Kaggle API¶

Create a base directory:

DATA_DIR = "../../../Data/home-credit-default-risk"   #same level as course repo in the data directory

Please download the project data files and data dictionary and unzip them using either of the following approaches:

  1. Click on the Download button on the following Data Webpage and unzip the zip file to the BASE_DIR
  2. If you plan to use the Kaggle API, please use the following steps.
In [9]:
DATA_DIR = r"C:\Users\tanub\Courses\AML526\I526_AML_Student\Assignments\Unit-Project-Home-Credit-Default-Risk\Phase2"
!mkdir $DATA_DIR
A subdirectory or file C:\Users\tanub\Courses\AML526\I526_AML_Student\Assignments\Unit-Project-Home-Credit-Default-Risk\Phase2 already exists.
In [10]:
DATA_DIR
Out[10]:
'C:\\Users\\tanub\\Courses\\AML526\\I526_AML_Student\\Assignments\\Unit-Project-Home-Credit-Default-Risk\\Phase2'
In [11]:
!dir $DATA_DIR
 Volume in drive C is Windows-SSD
 Volume Serial Number is F43F-BE58

 Directory of C:\Users\tanub\Courses\AML526\I526_AML_Student\Assignments\Unit-Project-Home-Credit-Default-Risk\Phase2

11/30/2023  07:04 PM    <DIR>          .
11/30/2023  07:04 PM    <DIR>          ..
11/30/2023  07:04 PM         3,182,122 HCDR_baseLine_submission_with_numerical_and_cat_features_to_kaggle.ipynb
11/30/2023  07:04 PM            66,899 home_credit.png
11/30/2023  07:04 PM                11 Phase2.md
11/30/2023  07:04 PM         1,368,981 submission.csv
11/30/2023  07:04 PM         1,091,396 submission.png
               5 File(s)      5,709,409 bytes
               2 Dir(s)  884,921,679,872 bytes free
In [12]:
! kaggle competitions download home-credit-default-risk -p $DATA_DIR
Downloading home-credit-default-risk.zip to C:\Users\tanub\Courses\AML526\I526_AML_Student\Assignments\Unit-Project-Home-Credit-Default-Risk\Phase2

  0%|          | 0.00/688M [00:00<?, ?B/s]
  0%|          | 1.00M/688M [00:00<02:13, 5.41MB/s]
  0%|          | 3.00M/688M [00:00<01:15, 9.54MB/s]
  1%|1         | 8.00M/688M [00:00<00:32, 21.9MB/s]
  2%|1         | 12.0M/688M [00:00<00:27, 26.1MB/s]
  2%|2         | 16.0M/688M [00:00<00:24, 28.7MB/s]
  3%|2         | 20.0M/688M [00:00<00:22, 30.6MB/s]
  3%|3         | 24.0M/688M [00:00<00:22, 31.5MB/s]
  4%|4         | 28.0M/688M [00:01<00:20, 33.1MB/s]
  5%|4         | 32.0M/688M [00:01<00:20, 33.5MB/s]
  5%|5         | 36.0M/688M [00:01<00:19, 34.5MB/s]
  6%|5         | 40.0M/688M [00:01<00:19, 35.1MB/s]
  6%|6         | 44.0M/688M [00:01<00:18, 35.7MB/s]
  7%|6         | 48.0M/688M [00:01<00:18, 35.8MB/s]
  8%|7         | 52.0M/688M [00:01<00:18, 36.7MB/s]
  8%|8         | 56.0M/688M [00:01<00:18, 36.8MB/s]
  9%|8         | 60.0M/688M [00:02<00:18, 35.6MB/s]
  9%|9         | 64.0M/688M [00:02<00:18, 35.2MB/s]
 10%|9         | 68.0M/688M [00:02<00:18, 36.0MB/s]
 11%|#         | 74.0M/688M [00:02<00:15, 42.2MB/s]
 12%|#1        | 80.0M/688M [00:02<00:13, 47.2MB/s]
 12%|#2        | 86.0M/688M [00:02<00:12, 50.1MB/s]
 13%|#3        | 92.0M/688M [00:02<00:11, 52.9MB/s]
 14%|#4        | 98.0M/688M [00:02<00:11, 54.4MB/s]
 15%|#5        | 104M/688M [00:02<00:11, 55.5MB/s] 
 16%|#5        | 110M/688M [00:02<00:10, 56.4MB/s]
 17%|#6        | 116M/688M [00:03<00:10, 57.4MB/s]
 18%|#7        | 122M/688M [00:03<00:10, 57.6MB/s]
 19%|#8        | 128M/688M [00:03<00:10, 57.8MB/s]
 19%|#9        | 134M/688M [00:03<00:10, 57.8MB/s]
 20%|##        | 140M/688M [00:03<00:09, 57.8MB/s]
 21%|##1       | 146M/688M [00:03<00:09, 58.3MB/s]
 22%|##2       | 152M/688M [00:03<00:09, 58.6MB/s]
 23%|##2       | 158M/688M [00:03<00:09, 57.8MB/s]
 24%|##3       | 164M/688M [00:03<00:09, 58.4MB/s]
 25%|##4       | 170M/688M [00:04<00:09, 58.6MB/s]
 26%|##5       | 176M/688M [00:04<00:09, 58.0MB/s]
 26%|##6       | 182M/688M [00:04<00:09, 58.3MB/s]
 27%|##7       | 188M/688M [00:04<00:08, 58.4MB/s]
 28%|##8       | 194M/688M [00:04<00:08, 58.7MB/s]
 29%|##9       | 200M/688M [00:04<00:08, 57.9MB/s]
 30%|##9       | 206M/688M [00:04<00:08, 57.1MB/s]
 31%|###       | 212M/688M [00:04<00:08, 57.9MB/s]
 32%|###1      | 218M/688M [00:04<00:08, 58.1MB/s]
 33%|###2      | 224M/688M [00:05<00:08, 58.4MB/s]
 33%|###3      | 230M/688M [00:05<00:08, 58.6MB/s]
 34%|###4      | 236M/688M [00:05<00:11, 41.0MB/s]
 35%|###5      | 242M/688M [00:05<00:10, 45.3MB/s]
 36%|###6      | 248M/688M [00:05<00:09, 48.6MB/s]
 37%|###6      | 254M/688M [00:05<00:08, 51.3MB/s]
 38%|###7      | 260M/688M [00:05<00:08, 53.4MB/s]
 39%|###8      | 266M/688M [00:05<00:08, 54.7MB/s]
 40%|###9      | 272M/688M [00:06<00:07, 55.8MB/s]
 40%|####      | 278M/688M [00:06<00:07, 56.5MB/s]
 41%|####1     | 284M/688M [00:06<00:07, 57.0MB/s]
 42%|####2     | 290M/688M [00:06<00:07, 57.4MB/s]
 43%|####3     | 296M/688M [00:06<00:07, 58.1MB/s]
 44%|####3     | 302M/688M [00:06<00:06, 58.1MB/s]
 45%|####4     | 308M/688M [00:06<00:06, 58.4MB/s]
 46%|####5     | 314M/688M [00:06<00:06, 58.5MB/s]
 46%|####6     | 320M/688M [00:06<00:06, 58.3MB/s]
 47%|####7     | 326M/688M [00:07<00:06, 58.1MB/s]
 48%|####8     | 332M/688M [00:07<00:06, 58.4MB/s]
 49%|####9     | 338M/688M [00:07<00:06, 58.7MB/s]
 50%|####9     | 344M/688M [00:07<00:06, 58.2MB/s]
 51%|#####     | 350M/688M [00:07<00:06, 58.7MB/s]
 52%|#####1    | 356M/688M [00:07<00:06, 55.5MB/s]
 53%|#####2    | 362M/688M [00:07<00:06, 56.4MB/s]
 53%|#####3    | 368M/688M [00:07<00:05, 57.1MB/s]
 54%|#####4    | 374M/688M [00:07<00:05, 57.2MB/s]
 55%|#####5    | 380M/688M [00:08<00:05, 56.6MB/s]
 56%|#####6    | 387M/688M [00:08<00:05, 58.2MB/s]
 57%|#####7    | 393M/688M [00:08<00:05, 57.9MB/s]
 58%|#####7    | 399M/688M [00:08<00:05, 58.1MB/s]
 59%|#####8    | 405M/688M [00:08<00:05, 59.0MB/s]
 60%|#####9    | 411M/688M [00:08<00:04, 58.2MB/s]
 61%|######    | 417M/688M [00:08<00:04, 58.4MB/s]
 61%|######1   | 423M/688M [00:08<00:04, 58.4MB/s]
 62%|######2   | 429M/688M [00:08<00:04, 58.5MB/s]
 63%|######3   | 435M/688M [00:09<00:04, 58.0MB/s]
 64%|######4   | 441M/688M [00:09<00:04, 58.9MB/s]
 65%|######4   | 447M/688M [00:09<00:04, 58.2MB/s]
 66%|######5   | 453M/688M [00:09<00:04, 58.8MB/s]
 67%|######6   | 459M/688M [00:09<00:04, 58.4MB/s]
 68%|######7   | 465M/688M [00:09<00:03, 58.8MB/s]
 68%|######8   | 471M/688M [00:09<00:03, 58.7MB/s]
 69%|######9   | 477M/688M [00:09<00:03, 58.6MB/s]
 70%|#######   | 483M/688M [00:09<00:03, 58.9MB/s]
 71%|#######1  | 489M/688M [00:09<00:03, 58.6MB/s]
 72%|#######1  | 495M/688M [00:10<00:03, 58.6MB/s]
 73%|#######2  | 501M/688M [00:10<00:03, 58.4MB/s]
 74%|#######3  | 507M/688M [00:10<00:03, 58.6MB/s]
 75%|#######4  | 513M/688M [00:10<00:03, 58.5MB/s]
 75%|#######5  | 519M/688M [00:10<00:04, 41.3MB/s]
 76%|#######6  | 525M/688M [00:10<00:03, 45.3MB/s]
 77%|#######7  | 531M/688M [00:10<00:03, 48.6MB/s]
 78%|#######8  | 537M/688M [00:10<00:03, 51.1MB/s]
 79%|#######8  | 543M/688M [00:11<00:02, 52.9MB/s]
 80%|#######9  | 549M/688M [00:11<00:02, 54.6MB/s]
 81%|########  | 555M/688M [00:11<00:02, 55.6MB/s]
 82%|########1 | 561M/688M [00:11<00:02, 56.9MB/s]
 82%|########2 | 567M/688M [00:11<00:02, 56.8MB/s]
 83%|########3 | 573M/688M [00:11<00:02, 57.8MB/s]
 84%|########4 | 579M/688M [00:11<00:01, 57.7MB/s]
 85%|########5 | 585M/688M [00:11<00:01, 58.0MB/s]
 86%|########5 | 591M/688M [00:11<00:01, 57.4MB/s]
 87%|########6 | 597M/688M [00:12<00:01, 58.5MB/s]
 88%|########7 | 603M/688M [00:12<00:01, 58.6MB/s]
 88%|########8 | 609M/688M [00:12<00:01, 58.7MB/s]
 89%|########9 | 615M/688M [00:12<00:01, 58.1MB/s]
 90%|######### | 621M/688M [00:12<00:01, 58.2MB/s]
 91%|#########1| 627M/688M [00:12<00:01, 58.9MB/s]
 92%|#########1| 633M/688M [00:12<00:00, 58.7MB/s]
 93%|#########2| 639M/688M [00:12<00:00, 58.7MB/s]
 94%|#########3| 645M/688M [00:12<00:00, 58.5MB/s]
 95%|#########4| 651M/688M [00:13<00:00, 58.1MB/s]
 95%|#########5| 657M/688M [00:13<00:00, 58.0MB/s]
 96%|#########6| 663M/688M [00:13<00:00, 58.2MB/s]
 97%|#########7| 669M/688M [00:13<00:00, 58.0MB/s]
 98%|#########8| 675M/688M [00:13<00:00, 58.1MB/s]
 99%|#########8| 681M/688M [00:13<00:00, 58.3MB/s]
100%|#########9| 687M/688M [00:13<00:00, 58.3MB/s]
100%|##########| 688M/688M [00:13<00:00, 52.7MB/s]
In [14]:
!dir $DATA_DIR
 Volume in drive C is Windows-SSD
 Volume Serial Number is F43F-BE58

 Directory of C:\Users\tanub\Courses\AML526\I526_AML_Student\Assignments\Unit-Project-Home-Credit-Default-Risk\Phase2

12/03/2023  06:58 PM    <DIR>          .
11/30/2023  07:04 PM    <DIR>          ..
11/30/2023  07:04 PM         3,182,122 HCDR_baseLine_submission_with_numerical_and_cat_features_to_kaggle.ipynb
12/11/2019  03:03 AM       721,616,255 home-credit-default-risk.zip
11/30/2023  07:04 PM            66,899 home_credit.png
11/30/2023  07:04 PM                11 Phase2.md
11/30/2023  07:04 PM         1,368,981 submission.csv
11/30/2023  07:04 PM         1,091,396 submission.png
               6 File(s)    727,325,664 bytes
               2 Dir(s)  884,199,927,808 bytes free
In [15]:
DATA_DIR
Out[15]:
'C:\\Users\\tanub\\Courses\\AML526\\I526_AML_Student\\Assignments\\Unit-Project-Home-Credit-Default-Risk\\Phase2'

Imports¶

In [16]:
import numpy as np
import pandas as pd 
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
In [17]:
unzippingReq = True #True
if unzippingReq: #please modify this code 
    zip_ref = zipfile.ZipFile(f'{DATA_DIR}/home-credit-default-risk.zip', 'r')
    # extractall():  Extract all members from the archive to the current working directory. path specifies a different directory to extract to
    zip_ref.extractall(DATA_DIR) 
    zip_ref.close()

Data files overview¶

Data Dictionary¶

As part of the data download comes a Data Dictionary. It named HomeCredit_columns_description.csv

image.png

The main Dataset - Application train¶

In [18]:
DATA_DIR = r"C:\Users\tanub\Courses\AML526\I526_AML_Student\Assignments\Unit-Project-Home-Credit-Default-Risk\Phase2\DATA_DIR"
In [19]:
DATA_DIR
Out[19]:
'C:\\Users\\tanub\\Courses\\AML526\\I526_AML_Student\\Assignments\\Unit-Project-Home-Credit-Default-Risk\\Phase2\\DATA_DIR'
In [274]:
import numpy as np
import pandas as pd 
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')

def load_data(in_path, name):
    df = pd.read_csv(in_path)
    print(f"{name}: shape is {df.shape}")
    print(df.info())
    display(df.head(5))
    return df

datasets={}  # lets store the datasets in a dictionary so we can keep track of them easily
ds_name = 'application_train'

datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)

datasets['application_train'].shape
application_train: shape is (307511, 122)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
None
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 122 columns

Out[274]:
(307511, 122)

The main Dataset - Application test¶

  • application_train/application_test: the main training and testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature SK_ID_CURR. The training application data comes with the TARGET indicating 0: the loan was repaid or 1: the loan was not repaid. The target variable defines if the client had payment difficulties meaning he/she had late payment more than X days on at least one of the first Y installments of the loan. Such case is marked as 1 while other all other cases as 0.
In [275]:
ds_name = 'application_test'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_test: shape is (48744, 121)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB
None
SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100001 Cash loans F N Y 0 135000.0 568800.0 20560.5 450000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
1 100005 Cash loans M N Y 0 99000.0 222768.0 17370.0 180000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
2 100013 Cash loans M Y Y 0 202500.0 663264.0 69777.0 630000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 1.0 4.0
3 100028 Cash loans F N Y 2 315000.0 1575000.0 49018.5 1575000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
4 100038 Cash loans M Y N 1 180000.0 625500.0 32067.0 625500.0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN

5 rows × 121 columns

The application dataset has the most information about the client: Gender, income, family status, education ...

The Other datasets and Raw Features¶

  • bureau: data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.
  • bureau_balance: monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length.
  • previous_application: previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.
  • POS_CASH_BALANCE: monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
  • credit_card_balance: monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
  • installments_payment: payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.
In [276]:
%%time
ds_names = ("application_train", "application_test", "bureau","bureau_balance","credit_card_balance","installments_payments",
            "previous_application","POS_CASH_balance")

for ds_name in ds_names:
    datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_train: shape is (307511, 122)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
None
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 122 columns

application_test: shape is (48744, 121)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB
None
SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100001 Cash loans F N Y 0 135000.0 568800.0 20560.5 450000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
1 100005 Cash loans M N Y 0 99000.0 222768.0 17370.0 180000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
2 100013 Cash loans M Y Y 0 202500.0 663264.0 69777.0 630000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 1.0 4.0
3 100028 Cash loans F N Y 2 315000.0 1575000.0 49018.5 1575000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
4 100038 Cash loans M Y N 1 180000.0 625500.0 32067.0 625500.0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN

5 rows × 121 columns

bureau: shape is (1716428, 17)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716428 entries, 0 to 1716427
Data columns (total 17 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   SK_ID_CURR              int64  
 1   SK_ID_BUREAU            int64  
 2   CREDIT_ACTIVE           object 
 3   CREDIT_CURRENCY         object 
 4   DAYS_CREDIT             int64  
 5   CREDIT_DAY_OVERDUE      int64  
 6   DAYS_CREDIT_ENDDATE     float64
 7   DAYS_ENDDATE_FACT       float64
 8   AMT_CREDIT_MAX_OVERDUE  float64
 9   CNT_CREDIT_PROLONG      int64  
 10  AMT_CREDIT_SUM          float64
 11  AMT_CREDIT_SUM_DEBT     float64
 12  AMT_CREDIT_SUM_LIMIT    float64
 13  AMT_CREDIT_SUM_OVERDUE  float64
 14  CREDIT_TYPE             object 
 15  DAYS_CREDIT_UPDATE      int64  
 16  AMT_ANNUITY             float64
dtypes: float64(8), int64(6), object(3)
memory usage: 222.6+ MB
None
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 215354 5714462 Closed currency 1 -497 0 -153.0 -153.0 NaN 0 91323.0 0.0 NaN 0.0 Consumer credit -131 NaN
1 215354 5714463 Active currency 1 -208 0 1075.0 NaN NaN 0 225000.0 171342.0 NaN 0.0 Credit card -20 NaN
2 215354 5714464 Active currency 1 -203 0 528.0 NaN NaN 0 464323.5 NaN NaN 0.0 Consumer credit -16 NaN
3 215354 5714465 Active currency 1 -203 0 NaN NaN NaN 0 90000.0 NaN NaN 0.0 Credit card -16 NaN
4 215354 5714466 Active currency 1 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN NaN 0.0 Consumer credit -21 NaN
bureau_balance: shape is (27299925, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27299925 entries, 0 to 27299924
Data columns (total 3 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   SK_ID_BUREAU    int64 
 1   MONTHS_BALANCE  int64 
 2   STATUS          object
dtypes: int64(2), object(1)
memory usage: 624.8+ MB
None
SK_ID_BUREAU MONTHS_BALANCE STATUS
0 5715448 0 C
1 5715448 -1 C
2 5715448 -2 C
3 5715448 -3 C
4 5715448 -4 C
credit_card_balance: shape is (3840312, 23)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3840312 entries, 0 to 3840311
Data columns (total 23 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   SK_ID_PREV                  int64  
 1   SK_ID_CURR                  int64  
 2   MONTHS_BALANCE              int64  
 3   AMT_BALANCE                 float64
 4   AMT_CREDIT_LIMIT_ACTUAL     int64  
 5   AMT_DRAWINGS_ATM_CURRENT    float64
 6   AMT_DRAWINGS_CURRENT        float64
 7   AMT_DRAWINGS_OTHER_CURRENT  float64
 8   AMT_DRAWINGS_POS_CURRENT    float64
 9   AMT_INST_MIN_REGULARITY     float64
 10  AMT_PAYMENT_CURRENT         float64
 11  AMT_PAYMENT_TOTAL_CURRENT   float64
 12  AMT_RECEIVABLE_PRINCIPAL    float64
 13  AMT_RECIVABLE               float64
 14  AMT_TOTAL_RECEIVABLE        float64
 15  CNT_DRAWINGS_ATM_CURRENT    float64
 16  CNT_DRAWINGS_CURRENT        int64  
 17  CNT_DRAWINGS_OTHER_CURRENT  float64
 18  CNT_DRAWINGS_POS_CURRENT    float64
 19  CNT_INSTALMENT_MATURE_CUM   float64
 20  NAME_CONTRACT_STATUS        object 
 21  SK_DPD                      int64  
 22  SK_DPD_DEF                  int64  
dtypes: float64(15), int64(7), object(1)
memory usage: 673.9+ MB
None
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE AMT_BALANCE AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT AMT_DRAWINGS_OTHER_CURRENT AMT_DRAWINGS_POS_CURRENT AMT_INST_MIN_REGULARITY ... AMT_RECIVABLE AMT_TOTAL_RECEIVABLE CNT_DRAWINGS_ATM_CURRENT CNT_DRAWINGS_CURRENT CNT_DRAWINGS_OTHER_CURRENT CNT_DRAWINGS_POS_CURRENT CNT_INSTALMENT_MATURE_CUM NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 2562384 378907 -6 56.970 135000 0.0 877.5 0.0 877.5 1700.325 ... 0.000 0.000 0.0 1 0.0 1.0 35.0 Active 0 0
1 2582071 363914 -1 63975.555 45000 2250.0 2250.0 0.0 0.0 2250.000 ... 64875.555 64875.555 1.0 1 0.0 0.0 69.0 Active 0 0
2 1740877 371185 -7 31815.225 450000 0.0 0.0 0.0 0.0 2250.000 ... 31460.085 31460.085 0.0 0 0.0 0.0 30.0 Active 0 0
3 1389973 337855 -4 236572.110 225000 2250.0 2250.0 0.0 0.0 11795.760 ... 233048.970 233048.970 1.0 1 0.0 0.0 10.0 Active 0 0
4 1891521 126868 -1 453919.455 450000 0.0 11547.0 0.0 11547.0 22924.890 ... 453919.455 453919.455 0.0 1 0.0 1.0 101.0 Active 0 0

5 rows × 23 columns

installments_payments: shape is (13605401, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13605401 entries, 0 to 13605400
Data columns (total 8 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   SK_ID_PREV              int64  
 1   SK_ID_CURR              int64  
 2   NUM_INSTALMENT_VERSION  float64
 3   NUM_INSTALMENT_NUMBER   int64  
 4   DAYS_INSTALMENT         float64
 5   DAYS_ENTRY_PAYMENT      float64
 6   AMT_INSTALMENT          float64
 7   AMT_PAYMENT             float64
dtypes: float64(5), int64(3)
memory usage: 830.4 MB
None
SK_ID_PREV SK_ID_CURR NUM_INSTALMENT_VERSION NUM_INSTALMENT_NUMBER DAYS_INSTALMENT DAYS_ENTRY_PAYMENT AMT_INSTALMENT AMT_PAYMENT
0 1054186 161674 1.0 6 -1180.0 -1187.0 6948.360 6948.360
1 1330831 151639 0.0 34 -2156.0 -2156.0 1716.525 1716.525
2 2085231 193053 2.0 1 -63.0 -63.0 25425.000 25425.000
3 2452527 199697 1.0 3 -2418.0 -2426.0 24350.130 24350.130
4 2714724 167756 1.0 2 -1383.0 -1366.0 2165.040 2160.585
previous_application: shape is (1670214, 37)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1670214 entries, 0 to 1670213
Data columns (total 37 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   SK_ID_PREV                   1670214 non-null  int64  
 1   SK_ID_CURR                   1670214 non-null  int64  
 2   NAME_CONTRACT_TYPE           1670214 non-null  object 
 3   AMT_ANNUITY                  1297979 non-null  float64
 4   AMT_APPLICATION              1670214 non-null  float64
 5   AMT_CREDIT                   1670213 non-null  float64
 6   AMT_DOWN_PAYMENT             774370 non-null   float64
 7   AMT_GOODS_PRICE              1284699 non-null  float64
 8   WEEKDAY_APPR_PROCESS_START   1670214 non-null  object 
 9   HOUR_APPR_PROCESS_START      1670214 non-null  int64  
 10  FLAG_LAST_APPL_PER_CONTRACT  1670214 non-null  object 
 11  NFLAG_LAST_APPL_IN_DAY       1670214 non-null  int64  
 12  RATE_DOWN_PAYMENT            774370 non-null   float64
 13  RATE_INTEREST_PRIMARY        5951 non-null     float64
 14  RATE_INTEREST_PRIVILEGED     5951 non-null     float64
 15  NAME_CASH_LOAN_PURPOSE       1670214 non-null  object 
 16  NAME_CONTRACT_STATUS         1670214 non-null  object 
 17  DAYS_DECISION                1670214 non-null  int64  
 18  NAME_PAYMENT_TYPE            1670214 non-null  object 
 19  CODE_REJECT_REASON           1670214 non-null  object 
 20  NAME_TYPE_SUITE              849809 non-null   object 
 21  NAME_CLIENT_TYPE             1670214 non-null  object 
 22  NAME_GOODS_CATEGORY          1670214 non-null  object 
 23  NAME_PORTFOLIO               1670214 non-null  object 
 24  NAME_PRODUCT_TYPE            1670214 non-null  object 
 25  CHANNEL_TYPE                 1670214 non-null  object 
 26  SELLERPLACE_AREA             1670214 non-null  int64  
 27  NAME_SELLER_INDUSTRY         1670214 non-null  object 
 28  CNT_PAYMENT                  1297984 non-null  float64
 29  NAME_YIELD_GROUP             1670214 non-null  object 
 30  PRODUCT_COMBINATION          1669868 non-null  object 
 31  DAYS_FIRST_DRAWING           997149 non-null   float64
 32  DAYS_FIRST_DUE               997149 non-null   float64
 33  DAYS_LAST_DUE_1ST_VERSION    997149 non-null   float64
 34  DAYS_LAST_DUE                997149 non-null   float64
 35  DAYS_TERMINATION             997149 non-null   float64
 36  NFLAG_INSURED_ON_APPROVAL    997149 non-null   float64
dtypes: float64(15), int64(6), object(16)
memory usage: 471.5+ MB
None
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START ... NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 2030495 271877 Consumer loans 1730.430 17145.0 17145.0 0.0 17145.0 SATURDAY 15 ... Connectivity 12.0 middle POS mobile with interest 365243.0 -42.0 300.0 -42.0 -37.0 0.0
1 2802425 108129 Cash loans 25188.615 607500.0 679671.0 NaN 607500.0 THURSDAY 11 ... XNA 36.0 low_action Cash X-Sell: low 365243.0 -134.0 916.0 365243.0 365243.0 1.0
2 2523466 122040 Cash loans 15060.735 112500.0 136444.5 NaN 112500.0 TUESDAY 11 ... XNA 12.0 high Cash X-Sell: high 365243.0 -271.0 59.0 365243.0 365243.0 1.0
3 2819243 176158 Cash loans 47041.335 450000.0 470790.0 NaN 450000.0 MONDAY 7 ... XNA 12.0 middle Cash X-Sell: middle 365243.0 -482.0 -152.0 -182.0 -177.0 1.0
4 1784265 202054 Cash loans 31924.395 337500.0 404055.0 NaN 337500.0 THURSDAY 9 ... XNA 24.0 high Cash Street: high NaN NaN NaN NaN NaN NaN

5 rows × 37 columns

POS_CASH_balance: shape is (10001358, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001358 entries, 0 to 10001357
Data columns (total 8 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   SK_ID_PREV             int64  
 1   SK_ID_CURR             int64  
 2   MONTHS_BALANCE         int64  
 3   CNT_INSTALMENT         float64
 4   CNT_INSTALMENT_FUTURE  float64
 5   NAME_CONTRACT_STATUS   object 
 6   SK_DPD                 int64  
 7   SK_DPD_DEF             int64  
dtypes: float64(2), int64(5), object(1)
memory usage: 610.4+ MB
None
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT CNT_INSTALMENT_FUTURE NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 1803195 182943 -31 48.0 45.0 Active 0 0
1 1715348 367990 -33 36.0 35.0 Active 0 0
2 1784872 397406 -32 12.0 9.0 Active 0 0
3 1903291 269225 -35 48.0 42.0 Active 0 0
4 2341044 334279 -35 36.0 35.0 Active 0 0
CPU times: total: 18.2 s
Wall time: 22.4 s

Dictionary for all Datasets and their sizes¶

In [277]:
for ds_name in datasets.keys():
    print(f'dataset {ds_name:24}: [ {datasets[ds_name].shape[0]:10,}, {datasets[ds_name].shape[1]}]')
    
freshdata = datasets
dataset application_train       : [    307,511, 122]
dataset application_test        : [     48,744, 121]
dataset bureau                  : [  1,716,428, 17]
dataset bureau_balance          : [ 27,299,925, 3]
dataset credit_card_balance     : [  3,840,312, 23]
dataset installments_payments   : [ 13,605,401, 8]
dataset previous_application    : [  1,670,214, 37]
dataset POS_CASH_balance        : [ 10,001,358, 8]

Exploratory Data Analysis¶

EDA for Application train and test¶

Apllication train info¶

In [10]:
datasets["application_train"].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB

Numerical Feaures of Application train¶

In [11]:
datasets["application_train"].describe() #numerical only features
Out[11]:
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.000000 307511.000000 307511.000000 3.075110e+05 3.075110e+05 307499.000000 3.072330e+05 307511.000000 307511.000000 307511.000000 ... 307511.000000 307511.000000 307511.000000 307511.000000 265992.000000 265992.000000 265992.000000 265992.000000 265992.000000 265992.000000
mean 278180.518577 0.080729 0.417052 1.687979e+05 5.990260e+05 27108.573909 5.383962e+05 0.020868 -16036.995067 63815.045904 ... 0.008130 0.000595 0.000507 0.000335 0.006402 0.007000 0.034362 0.267395 0.265474 1.899974
std 102790.175348 0.272419 0.722121 2.371231e+05 4.024908e+05 14493.737315 3.694465e+05 0.013831 4363.988632 141275.766519 ... 0.089798 0.024387 0.022518 0.018299 0.083849 0.110757 0.204685 0.916002 0.794056 1.869295
min 100002.000000 0.000000 0.000000 2.565000e+04 4.500000e+04 1615.500000 4.050000e+04 0.000290 -25229.000000 -17912.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 189145.500000 0.000000 0.000000 1.125000e+05 2.700000e+05 16524.000000 2.385000e+05 0.010006 -19682.000000 -2760.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 278202.000000 0.000000 0.000000 1.471500e+05 5.135310e+05 24903.000000 4.500000e+05 0.018850 -15750.000000 -1213.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
75% 367142.500000 0.000000 1.000000 2.025000e+05 8.086500e+05 34596.000000 6.795000e+05 0.028663 -12413.000000 -289.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.000000
max 456255.000000 1.000000 19.000000 1.170000e+08 4.050000e+06 258025.500000 4.050000e+06 0.072508 -7489.000000 365243.000000 ... 1.000000 1.000000 1.000000 1.000000 4.000000 9.000000 8.000000 27.000000 261.000000 25.000000

8 rows × 106 columns

Numerical Features of Application test¶

In [12]:
datasets["application_test"].describe() #numerical only features
Out[12]:
SK_ID_CURR CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 48744.000000 48744.000000 4.874400e+04 4.874400e+04 48720.000000 4.874400e+04 48744.000000 48744.000000 48744.000000 48744.000000 ... 48744.000000 48744.0 48744.0 48744.0 42695.000000 42695.000000 42695.000000 42695.000000 42695.000000 42695.000000
mean 277796.676350 0.397054 1.784318e+05 5.167404e+05 29426.240209 4.626188e+05 0.021226 -16068.084605 67485.366322 -4967.652716 ... 0.001559 0.0 0.0 0.0 0.002108 0.001803 0.002787 0.009299 0.546902 1.983769
std 103169.547296 0.709047 1.015226e+05 3.653970e+05 16016.368315 3.367102e+05 0.014428 4325.900393 144348.507136 3552.612035 ... 0.039456 0.0 0.0 0.0 0.046373 0.046132 0.054037 0.110924 0.693305 1.838873
min 100001.000000 0.000000 2.694150e+04 4.500000e+04 2295.000000 4.500000e+04 0.000253 -25195.000000 -17463.000000 -23722.000000 ... 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 188557.750000 0.000000 1.125000e+05 2.606400e+05 17973.000000 2.250000e+05 0.010006 -19637.000000 -2910.000000 -7459.250000 ... 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 277549.000000 0.000000 1.575000e+05 4.500000e+05 26199.000000 3.960000e+05 0.018850 -15785.000000 -1293.000000 -4490.000000 ... 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000
75% 367555.500000 1.000000 2.250000e+05 6.750000e+05 37390.500000 6.300000e+05 0.028663 -12496.000000 -296.000000 -1901.000000 ... 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 1.000000 3.000000
max 456250.000000 20.000000 4.410000e+06 2.245500e+06 180576.000000 2.245500e+06 0.072508 -7338.000000 365243.000000 0.000000 ... 1.000000 0.0 0.0 0.0 2.000000 2.000000 2.000000 6.000000 7.000000 17.000000

8 rows × 105 columns

All Numerical and categorical features of Application Train¶

In [13]:
datasets["application_train"].describe(include='all') #look at all categorical and numerical
Out[13]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.000000 307511.000000 307511 307511 307511 307511 307511.000000 3.075110e+05 3.075110e+05 307499.000000 ... 307511.000000 307511.000000 307511.000000 307511.000000 265992.000000 265992.000000 265992.000000 265992.000000 265992.000000 265992.000000
unique NaN NaN 2 3 2 2 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
top NaN NaN Cash loans F N Y NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
freq NaN NaN 278232 202448 202924 213312 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean 278180.518577 0.080729 NaN NaN NaN NaN 0.417052 1.687979e+05 5.990260e+05 27108.573909 ... 0.008130 0.000595 0.000507 0.000335 0.006402 0.007000 0.034362 0.267395 0.265474 1.899974
std 102790.175348 0.272419 NaN NaN NaN NaN 0.722121 2.371231e+05 4.024908e+05 14493.737315 ... 0.089798 0.024387 0.022518 0.018299 0.083849 0.110757 0.204685 0.916002 0.794056 1.869295
min 100002.000000 0.000000 NaN NaN NaN NaN 0.000000 2.565000e+04 4.500000e+04 1615.500000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 189145.500000 0.000000 NaN NaN NaN NaN 0.000000 1.125000e+05 2.700000e+05 16524.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 278202.000000 0.000000 NaN NaN NaN NaN 0.000000 1.471500e+05 5.135310e+05 24903.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
75% 367142.500000 0.000000 NaN NaN NaN NaN 1.000000 2.025000e+05 8.086500e+05 34596.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.000000
max 456255.000000 1.000000 NaN NaN NaN NaN 19.000000 1.170000e+08 4.050000e+06 258025.500000 ... 1.000000 1.000000 1.000000 1.000000 4.000000 9.000000 8.000000 27.000000 261.000000 25.000000

11 rows × 122 columns

Percent and Number of Missing data for application train¶

In [14]:
percent = (datasets["application_train"].isnull().sum()/datasets["application_train"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_train"].isna().sum().sort_values(ascending = False)
missing_application_train_data  = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
missing_application_train_data.head(20)
Out[14]:
Percent Train Missing Count
COMMONAREA_MEDI 69.87 214865
COMMONAREA_AVG 69.87 214865
COMMONAREA_MODE 69.87 214865
NONLIVINGAPARTMENTS_MODE 69.43 213514
NONLIVINGAPARTMENTS_AVG 69.43 213514
NONLIVINGAPARTMENTS_MEDI 69.43 213514
FONDKAPREMONT_MODE 68.39 210295
LIVINGAPARTMENTS_MODE 68.35 210199
LIVINGAPARTMENTS_AVG 68.35 210199
LIVINGAPARTMENTS_MEDI 68.35 210199
FLOORSMIN_AVG 67.85 208642
FLOORSMIN_MODE 67.85 208642
FLOORSMIN_MEDI 67.85 208642
YEARS_BUILD_MEDI 66.50 204488
YEARS_BUILD_MODE 66.50 204488
YEARS_BUILD_AVG 66.50 204488
OWN_CAR_AGE 65.99 202929
LANDAREA_MEDI 59.38 182590
LANDAREA_MODE 59.38 182590
LANDAREA_AVG 59.38 182590

Percent and Number of Missing data for application test¶

In [16]:
percent = (datasets["application_test"].isnull().sum()/datasets["application_test"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_test"].isna().sum().sort_values(ascending = False)
missing_application_train_data  = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_train_data.head(20)
Out[16]:
Percent Test Missing Count
COMMONAREA_AVG 68.72 33495
COMMONAREA_MODE 68.72 33495
COMMONAREA_MEDI 68.72 33495
NONLIVINGAPARTMENTS_AVG 68.41 33347
NONLIVINGAPARTMENTS_MODE 68.41 33347
NONLIVINGAPARTMENTS_MEDI 68.41 33347
FONDKAPREMONT_MODE 67.28 32797
LIVINGAPARTMENTS_AVG 67.25 32780
LIVINGAPARTMENTS_MODE 67.25 32780
LIVINGAPARTMENTS_MEDI 67.25 32780
FLOORSMIN_MEDI 66.61 32466
FLOORSMIN_AVG 66.61 32466
FLOORSMIN_MODE 66.61 32466
OWN_CAR_AGE 66.29 32312
YEARS_BUILD_AVG 65.28 31818
YEARS_BUILD_MEDI 65.28 31818
YEARS_BUILD_MODE 65.28 31818
LANDAREA_MEDI 57.96 28254
LANDAREA_AVG 57.96 28254
LANDAREA_MODE 57.96 28254

Summary of Missing Data in application train¶

In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML

# Assuming datasets is a DataFrame containing your data

def stats_summary1(df, df_name):
    print(datasets[df_name].info(verbose=True))
    print("-----" * 15)
    print(f"Shape of the df {df_name} is {df.shape} \n")
    print("-----" * 15)
    print(f"Statistical summary of {df_name} is :")
    print("-----" * 15)
    print(f"Description of the df {df_name}:\n")
    print(display(HTML(np.round(datasets['application_train'].describe(), 2).to_html())))

def stats_summary2(df, df_name):
    print(f"Description of the df continued for {df_name}:\n")
    print("-----" * 15)
    print("Data type value counts: \n", df.dtypes.value_counts())
    print("\nReturn the number of unique elements in the object. \n")
    print(df.select_dtypes('object').apply(pd.Series.nunique, axis=0))

# List the categorical and Numerical features of a DF
def feature_datatypes_groups(df, df_name):
    df_dtypes = df.columns.to_series().groupby(df.dtypes).groups
    print("-----" * 15)
    print(f"Categorical and Numerical(int + float) features  of {df_name}.")
    print("-----" * 15)
    print()
    for k, v in df_dtypes.items():
        print({k.name: v})
        print("---" * 10)
    print("\n \n")

# Null data list and plot.
def null_data_plot(df, df_name):
    percent = (df.isnull().sum() / df.isnull().count() * 100).sort_values(ascending=False).round(2)
    sum_missing = df.isna().sum().sort_values(ascending=False)
    missing_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
    missing_data = missing_data[missing_data['Percent'] > 0]
    print("-----" * 15)
    print("-----" * 15)
    print('\n The Missing Data: \n')
    if len(missing_data) == 0:
        print("No missing Data")
    else:
        display(HTML(missing_data.to_html()))  # display all the rows
        print("-----" * 15)

        # Plotting the missing data heatmap
        plt.figure(figsize=(12, 8))
        sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
        plt.title(f'Missing Data Heatmap for {df_name}', fontsize=16)
        plt.show()

        if len(df.columns) > 35:
            f, ax = plt.subplots(figsize=(8, 15))
        else:
            f, ax = plt.subplots()

        plt.title(f'Percent missing data for {df_name}.', fontsize=10)
        fig = sns.barplot(missing_data["Percent"], missing_data.index, alpha=0.8)
        plt.xlabel('Percent of missing values', fontsize=10)
        plt.ylabel('Features', fontsize=10)
        return missing_data

# Full consolidation of all the stats function.
def display_stats(df, df_name):
    print("--" * 40)
    print(" " * 20 + '\033[1m' + df_name + '\033[0m' + " " * 20)
    print("--" * 40)
    stats_summary1(df, df_name)

def display_feature_info(df, df_name):
    stats_summary2(df, df_name)
    feature_datatypes_groups(df, df_name)
    null_data_plot(df, df_name)

# Example usage:
# Assuming 'datasets' is your DataFrame and 'application_train' is one of its components
display_stats(datasets["application_train"], "application_train")
display_feature_info(datasets["application_train"], "application_train")
--------------------------------------------------------------------------------
                    application_train                    
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 122 columns):
 #    Column                        Dtype  
---   ------                        -----  
 0    SK_ID_CURR                    int64  
 1    TARGET                        int64  
 2    NAME_CONTRACT_TYPE            object 
 3    CODE_GENDER                   object 
 4    FLAG_OWN_CAR                  object 
 5    FLAG_OWN_REALTY               object 
 6    CNT_CHILDREN                  int64  
 7    AMT_INCOME_TOTAL              float64
 8    AMT_CREDIT                    float64
 9    AMT_ANNUITY                   float64
 10   AMT_GOODS_PRICE               float64
 11   NAME_TYPE_SUITE               object 
 12   NAME_INCOME_TYPE              object 
 13   NAME_EDUCATION_TYPE           object 
 14   NAME_FAMILY_STATUS            object 
 15   NAME_HOUSING_TYPE             object 
 16   REGION_POPULATION_RELATIVE    float64
 17   DAYS_BIRTH                    int64  
 18   DAYS_EMPLOYED                 int64  
 19   DAYS_REGISTRATION             float64
 20   DAYS_ID_PUBLISH               int64  
 21   OWN_CAR_AGE                   float64
 22   FLAG_MOBIL                    int64  
 23   FLAG_EMP_PHONE                int64  
 24   FLAG_WORK_PHONE               int64  
 25   FLAG_CONT_MOBILE              int64  
 26   FLAG_PHONE                    int64  
 27   FLAG_EMAIL                    int64  
 28   OCCUPATION_TYPE               object 
 29   CNT_FAM_MEMBERS               float64
 30   REGION_RATING_CLIENT          int64  
 31   REGION_RATING_CLIENT_W_CITY   int64  
 32   WEEKDAY_APPR_PROCESS_START    object 
 33   HOUR_APPR_PROCESS_START       int64  
 34   REG_REGION_NOT_LIVE_REGION    int64  
 35   REG_REGION_NOT_WORK_REGION    int64  
 36   LIVE_REGION_NOT_WORK_REGION   int64  
 37   REG_CITY_NOT_LIVE_CITY        int64  
 38   REG_CITY_NOT_WORK_CITY        int64  
 39   LIVE_CITY_NOT_WORK_CITY       int64  
 40   ORGANIZATION_TYPE             object 
 41   EXT_SOURCE_1                  float64
 42   EXT_SOURCE_2                  float64
 43   EXT_SOURCE_3                  float64
 44   APARTMENTS_AVG                float64
 45   BASEMENTAREA_AVG              float64
 46   YEARS_BEGINEXPLUATATION_AVG   float64
 47   YEARS_BUILD_AVG               float64
 48   COMMONAREA_AVG                float64
 49   ELEVATORS_AVG                 float64
 50   ENTRANCES_AVG                 float64
 51   FLOORSMAX_AVG                 float64
 52   FLOORSMIN_AVG                 float64
 53   LANDAREA_AVG                  float64
 54   LIVINGAPARTMENTS_AVG          float64
 55   LIVINGAREA_AVG                float64
 56   NONLIVINGAPARTMENTS_AVG       float64
 57   NONLIVINGAREA_AVG             float64
 58   APARTMENTS_MODE               float64
 59   BASEMENTAREA_MODE             float64
 60   YEARS_BEGINEXPLUATATION_MODE  float64
 61   YEARS_BUILD_MODE              float64
 62   COMMONAREA_MODE               float64
 63   ELEVATORS_MODE                float64
 64   ENTRANCES_MODE                float64
 65   FLOORSMAX_MODE                float64
 66   FLOORSMIN_MODE                float64
 67   LANDAREA_MODE                 float64
 68   LIVINGAPARTMENTS_MODE         float64
 69   LIVINGAREA_MODE               float64
 70   NONLIVINGAPARTMENTS_MODE      float64
 71   NONLIVINGAREA_MODE            float64
 72   APARTMENTS_MEDI               float64
 73   BASEMENTAREA_MEDI             float64
 74   YEARS_BEGINEXPLUATATION_MEDI  float64
 75   YEARS_BUILD_MEDI              float64
 76   COMMONAREA_MEDI               float64
 77   ELEVATORS_MEDI                float64
 78   ENTRANCES_MEDI                float64
 79   FLOORSMAX_MEDI                float64
 80   FLOORSMIN_MEDI                float64
 81   LANDAREA_MEDI                 float64
 82   LIVINGAPARTMENTS_MEDI         float64
 83   LIVINGAREA_MEDI               float64
 84   NONLIVINGAPARTMENTS_MEDI      float64
 85   NONLIVINGAREA_MEDI            float64
 86   FONDKAPREMONT_MODE            object 
 87   HOUSETYPE_MODE                object 
 88   TOTALAREA_MODE                float64
 89   WALLSMATERIAL_MODE            object 
 90   EMERGENCYSTATE_MODE           object 
 91   OBS_30_CNT_SOCIAL_CIRCLE      float64
 92   DEF_30_CNT_SOCIAL_CIRCLE      float64
 93   OBS_60_CNT_SOCIAL_CIRCLE      float64
 94   DEF_60_CNT_SOCIAL_CIRCLE      float64
 95   DAYS_LAST_PHONE_CHANGE        float64
 96   FLAG_DOCUMENT_2               int64  
 97   FLAG_DOCUMENT_3               int64  
 98   FLAG_DOCUMENT_4               int64  
 99   FLAG_DOCUMENT_5               int64  
 100  FLAG_DOCUMENT_6               int64  
 101  FLAG_DOCUMENT_7               int64  
 102  FLAG_DOCUMENT_8               int64  
 103  FLAG_DOCUMENT_9               int64  
 104  FLAG_DOCUMENT_10              int64  
 105  FLAG_DOCUMENT_11              int64  
 106  FLAG_DOCUMENT_12              int64  
 107  FLAG_DOCUMENT_13              int64  
 108  FLAG_DOCUMENT_14              int64  
 109  FLAG_DOCUMENT_15              int64  
 110  FLAG_DOCUMENT_16              int64  
 111  FLAG_DOCUMENT_17              int64  
 112  FLAG_DOCUMENT_18              int64  
 113  FLAG_DOCUMENT_19              int64  
 114  FLAG_DOCUMENT_20              int64  
 115  FLAG_DOCUMENT_21              int64  
 116  AMT_REQ_CREDIT_BUREAU_HOUR    float64
 117  AMT_REQ_CREDIT_BUREAU_DAY     float64
 118  AMT_REQ_CREDIT_BUREAU_WEEK    float64
 119  AMT_REQ_CREDIT_BUREAU_MON     float64
 120  AMT_REQ_CREDIT_BUREAU_QRT     float64
 121  AMT_REQ_CREDIT_BUREAU_YEAR    float64
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
None
---------------------------------------------------------------------------
Shape of the df application_train is (307511, 122) 

---------------------------------------------------------------------------
Statistical summary of application_train is :
---------------------------------------------------------------------------
Description of the df application_train:

SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.00 307511.00 307511.00 3.075110e+05 307511.00 307499.00 307233.00 307511.00 307511.00 307511.00 307511.00 307511.00 104582.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307509.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 134133.00 306851.00 246546.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 159080.00 306490.00 306490.00 306490.00 306490.00 307510.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 265992.00 265992.00 265992.00 265992.00 265992.00 265992.00
mean 278180.52 0.08 0.42 1.687979e+05 599026.00 27108.57 538396.21 0.02 -16037.00 63815.05 -4986.12 -2994.20 12.06 1.0 0.82 0.2 1.00 0.28 0.06 2.15 2.05 2.03 12.06 0.02 0.05 0.04 0.08 0.23 0.18 0.50 0.51 0.51 0.12 0.09 0.98 0.75 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.11 0.09 0.98 0.76 0.04 0.07 0.15 0.22 0.23 0.06 0.11 0.11 0.01 0.03 0.12 0.09 0.98 0.76 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.10 1.42 0.14 1.41 0.10 -962.86 0.00 0.71 0.00 0.02 0.09 0.00 0.08 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.03 0.27 0.27 1.90
std 102790.18 0.27 0.72 2.371231e+05 402490.78 14493.74 369446.46 0.01 4363.99 141275.77 3522.89 1509.45 11.94 0.0 0.38 0.4 0.04 0.45 0.23 0.91 0.51 0.50 3.27 0.12 0.22 0.20 0.27 0.42 0.38 0.21 0.19 0.19 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.14 0.16 0.08 0.09 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.07 0.13 0.10 0.14 0.16 0.08 0.10 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.15 0.16 0.08 0.09 0.11 0.05 0.07 0.11 2.40 0.45 2.38 0.36 826.81 0.01 0.45 0.01 0.12 0.28 0.01 0.27 0.06 0.0 0.06 0.0 0.06 0.05 0.03 0.10 0.02 0.09 0.02 0.02 0.02 0.08 0.11 0.20 0.92 0.79 1.87
min 100002.00 0.00 0.00 2.565000e+04 45000.00 1615.50 40500.00 0.00 -25229.00 -17912.00 -24672.00 -7197.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -4292.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 189145.50 0.00 0.00 1.125000e+05 270000.00 16524.00 238500.00 0.01 -19682.00 -2760.00 -7479.50 -4299.00 5.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.39 0.37 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.05 0.04 0.98 0.70 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.04 0.00 0.00 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.04 0.00 0.00 0.00 0.00 -1570.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 278202.00 0.00 0.00 1.471500e+05 513531.00 24903.00 450000.00 0.02 -15750.00 -1213.00 -4504.00 -3254.00 9.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.57 0.54 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.08 0.07 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.07 0.00 0.00 0.00 0.00 -757.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
75% 367142.50 0.00 1.00 2.025000e+05 808650.00 34596.00 679500.00 0.03 -12413.00 -289.00 -2010.00 -1720.00 15.00 1.0 1.00 0.0 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.66 0.67 0.15 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.14 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.08 0.13 0.13 0.00 0.02 0.15 0.11 0.99 0.83 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.13 2.00 0.00 2.00 0.00 -274.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
max 456255.00 1.00 19.00 1.170000e+08 4050000.00 258025.50 4050000.00 0.07 -7489.00 365243.00 0.00 0.00 91.00 1.0 1.00 1.0 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.85 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 348.00 34.00 344.00 24.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0 1.00 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00
None
Description of the df continued for application_train:

---------------------------------------------------------------------------
Data type value counts: 
 float64    65
int64      41
object     16
Name: count, dtype: int64

Return the number of unique elements in the object. 

NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features  of application_train.
---------------------------------------------------------------------------

{'int64': Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
       'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE',
       'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'FLAG_DOCUMENT_2',
       'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5',
       'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8',
       'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11',
       'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14',
       'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17',
       'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
       'FLAG_DOCUMENT_21'],
      dtype='object')}
------------------------------
{'float64': Index(['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE',
       'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
       'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG',
       'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG',
       'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG',
       'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG',
       'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE',
       'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE',
       'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE',
       'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE',
       'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI',
       'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI',
       'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI',
       'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI',
       'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI',
       'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE',
       'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE',
       'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE',
       'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
       'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
       'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype='object')}
------------------------------
{'object': Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
       'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
       'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE',
       'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE'],
      dtype='object')}
------------------------------

 

---------------------------------------------------------------------------
---------------------------------------------------------------------------

 The Missing Data: 

Percent Train Missing Count
COMMONAREA_MEDI 69.87 214865
COMMONAREA_AVG 69.87 214865
COMMONAREA_MODE 69.87 214865
NONLIVINGAPARTMENTS_MODE 69.43 213514
NONLIVINGAPARTMENTS_AVG 69.43 213514
NONLIVINGAPARTMENTS_MEDI 69.43 213514
FONDKAPREMONT_MODE 68.39 210295
LIVINGAPARTMENTS_MODE 68.35 210199
LIVINGAPARTMENTS_AVG 68.35 210199
LIVINGAPARTMENTS_MEDI 68.35 210199
FLOORSMIN_AVG 67.85 208642
FLOORSMIN_MODE 67.85 208642
FLOORSMIN_MEDI 67.85 208642
YEARS_BUILD_MEDI 66.50 204488
YEARS_BUILD_MODE 66.50 204488
YEARS_BUILD_AVG 66.50 204488
OWN_CAR_AGE 65.99 202929
LANDAREA_MEDI 59.38 182590
LANDAREA_MODE 59.38 182590
LANDAREA_AVG 59.38 182590
BASEMENTAREA_MEDI 58.52 179943
BASEMENTAREA_AVG 58.52 179943
BASEMENTAREA_MODE 58.52 179943
EXT_SOURCE_1 56.38 173378
NONLIVINGAREA_MODE 55.18 169682
NONLIVINGAREA_AVG 55.18 169682
NONLIVINGAREA_MEDI 55.18 169682
ELEVATORS_MEDI 53.30 163891
ELEVATORS_AVG 53.30 163891
ELEVATORS_MODE 53.30 163891
WALLSMATERIAL_MODE 50.84 156341
APARTMENTS_MEDI 50.75 156061
APARTMENTS_AVG 50.75 156061
APARTMENTS_MODE 50.75 156061
ENTRANCES_MEDI 50.35 154828
ENTRANCES_AVG 50.35 154828
ENTRANCES_MODE 50.35 154828
LIVINGAREA_AVG 50.19 154350
LIVINGAREA_MODE 50.19 154350
LIVINGAREA_MEDI 50.19 154350
HOUSETYPE_MODE 50.18 154297
FLOORSMAX_MODE 49.76 153020
FLOORSMAX_MEDI 49.76 153020
FLOORSMAX_AVG 49.76 153020
YEARS_BEGINEXPLUATATION_MODE 48.78 150007
YEARS_BEGINEXPLUATATION_MEDI 48.78 150007
YEARS_BEGINEXPLUATATION_AVG 48.78 150007
TOTALAREA_MODE 48.27 148431
EMERGENCYSTATE_MODE 47.40 145755
OCCUPATION_TYPE 31.35 96391
EXT_SOURCE_3 19.83 60965
AMT_REQ_CREDIT_BUREAU_HOUR 13.50 41519
AMT_REQ_CREDIT_BUREAU_DAY 13.50 41519
AMT_REQ_CREDIT_BUREAU_WEEK 13.50 41519
AMT_REQ_CREDIT_BUREAU_MON 13.50 41519
AMT_REQ_CREDIT_BUREAU_QRT 13.50 41519
AMT_REQ_CREDIT_BUREAU_YEAR 13.50 41519
NAME_TYPE_SUITE 0.42 1292
OBS_30_CNT_SOCIAL_CIRCLE 0.33 1021
DEF_30_CNT_SOCIAL_CIRCLE 0.33 1021
OBS_60_CNT_SOCIAL_CIRCLE 0.33 1021
DEF_60_CNT_SOCIAL_CIRCLE 0.33 1021
EXT_SOURCE_2 0.21 660
AMT_GOODS_PRICE 0.09 278
---------------------------------------------------------------------------

Finding 1¶

Anomalies are apparent in the dataset based on descriptive statistics for features like Days Birth, Days Employed, Days Registration, and Days ID Publish, where negative values are present, which is unexpected.
The maximum value for Own Car Age is 91.
Certain features related to living space and realty appear to be redundant, and considering their removal during the feature reduction process would be beneficial to mitigate potential issues with multicollinearity.

Exploring DaysEmployed, Days Birth, Days Registration, Days ID Publish, Own Car Age¶

In [25]:
pip install seaborn --upgrade
Requirement already satisfied: seaborn in /usr/local/lib/python3.9/site-packages (0.11.2)
Collecting seaborn
  Downloading seaborn-0.13.0-py3-none-any.whl (294 kB)
     |████████████████████████████████| 294 kB 1.2 MB/s            
Requirement already satisfied: matplotlib!=3.6.1,>=3.3 in /usr/local/lib/python3.9/site-packages (from seaborn) (3.4.3)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in /usr/local/lib/python3.9/site-packages (from seaborn) (1.26.2)
Requirement already satisfied: pandas>=1.2 in /usr/local/lib/python3.9/site-packages (from seaborn) (2.1.3)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.3->seaborn) (0.11.0)
Requirement already satisfied: pyparsing>=2.2.1 in /usr/local/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.3->seaborn) (3.0.6)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.3->seaborn) (1.3.2)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.3->seaborn) (9.0.0)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.9/site-packages (from matplotlib!=3.6.1,>=3.3->seaborn) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.9/site-packages (from pandas>=1.2->seaborn) (2021.3)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.9/site-packages (from pandas>=1.2->seaborn) (2023.3)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.3->seaborn) (1.15.0)
Installing collected packages: seaborn
  Attempting uninstall: seaborn
    Found existing installation: seaborn 0.11.2
    Uninstalling seaborn-0.11.2:
      Successfully uninstalled seaborn-0.11.2
Successfully installed seaborn-0.13.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 21.3.1; however, version 23.3.1 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.
In [11]:
df_train = datasets["application_train"]

Distribution of the Negative Values¶

In [13]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set up a function to handle NaN and infinite values for histograms
def plot_histogram(feature, xlabel, title, color, xlim=None):
    plt.figure(figsize=(10, 6))
    sns.histplot(df_train[feature].replace([np.inf, -np.inf], np.nan).dropna(), kde=False, bins=30, color=color)
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel('Frequency')
    
    if xlim:
        plt.xlim(xlim)
    
    plt.show()

# OWN_CAR_AGE: Histogram for negative values
plot_histogram('OWN_CAR_AGE', 'Own Car Age', 'Distribution of Own Car Age (Negative Values)', 'skyblue', xlim=(df_train['OWN_CAR_AGE'].min(), 0))

# DAYS_BIRTH: Age Distribution
plot_histogram('DAYS_BIRTH', 'Age (years)', 'Distribution of Age', 'salmon')

# DAYS_EMPLOYED: Employment Duration Distribution
plot_histogram('DAYS_EMPLOYED', 'Employment Duration (years)', 'Distribution of Employment Duration', 'lightgreen', xlim=(df_train['DAYS_EMPLOYED'].min(), 0))

# DAYS_REGISTRATION: Days Since Registration Distribution
plot_histogram('DAYS_REGISTRATION', 'Days Since Registration (years)', 'Distribution of Days Since Registration', 'orange', xlim=(df_train['DAYS_REGISTRATION'].min(), 0))

Finding 2¶

The training dataset, referred to as "Application Train," contains extensive information about submitted loan requests.
However, the presence of missing values is a notable concern within this dataset. Particularly, Occupation Type and Organization Type are categorical variables with 58 and 18 categories, respectively.
These categorical features hold the potential for valuable insights in the process of feature engineering.

Distribution of the target column¶

In [15]:
plt.figure(figsize=(10,5))
sns.set_theme()
sns.countplot(x = 'TARGET',data = df_train)
plt.xlabel("Target",fontweight='bold',size=13)
plt.ylabel("Count",fontweight='bold',size=13)
plt.show()

Correlation analysis with the target column¶

In [21]:
correlations = datasets["application_train"].corr()['TARGET'].sort_values()
print('Most Positive Correlations:\n', correlations.tail(10))
print('\nMost Negative Correlations:\n', correlations.head(10))
Most Positive Correlations:
 FLAG_DOCUMENT_3                0.044346
REG_CITY_NOT_LIVE_CITY         0.044395
FLAG_EMP_PHONE                 0.045982
REG_CITY_NOT_WORK_CITY         0.050994
DAYS_ID_PUBLISH                0.051457
DAYS_LAST_PHONE_CHANGE         0.055218
REGION_RATING_CLIENT           0.058899
REGION_RATING_CLIENT_W_CITY    0.060893
DAYS_BIRTH                     0.078239
TARGET                         1.000000
Name: TARGET, dtype: float64

Most Negative Correlations:
 EXT_SOURCE_3                 -0.178919
EXT_SOURCE_2                 -0.160472
EXT_SOURCE_1                 -0.155317
DAYS_EMPLOYED                -0.044932
FLOORSMAX_AVG                -0.044003
FLOORSMAX_MEDI               -0.043768
FLOORSMAX_MODE               -0.043226
AMT_GOODS_PRICE              -0.039645
REGION_POPULATION_RELATIVE   -0.037227
ELEVATORS_AVG                -0.034199
Name: TARGET, dtype: float64

Age of the Applicants¶

In [22]:
plt.hist(datasets["application_train"]['DAYS_BIRTH'] / -365, edgecolor = 'k', bins = 25)
plt.title('Age of Client'); plt.xlabel('Age (years)'); plt.ylabel('Count');

Occupations of applicants¶

In [22]:
sns.countplot(x='OCCUPATION_TYPE', data=datasets["application_train"]);
plt.title('Applicants Occupation');
plt.xticks(rotation=90);
In [102]:
df_train = datasets["application_train"]

Looking at Categorical Attributes in Application train¶

In [25]:
import seaborn as sns

# IGNORE Warnings
import warnings
warnings.filterwarnings("ignore")

categorical_attributes = ['TARGET', 'CODE_GENDER', 'FLAG_OWN_REALTY', 'FLAG_OWN_CAR', 'NAME_CONTRACT_TYPE',
                          'NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'NAME_INCOME_TYPE']

fig, axes = plt.subplots(2, 4, figsize=(30, 20))
plt.subplots_adjust(left=None, bottom=None, right=None,
                    top=None, wspace=None, hspace=0.45)

plot_number = 0
for i in range(0, 2):
    for j in range(0, 4):
        current_plot = sns.countplot(x=categorical_attributes[plot_number],
                                    data=df_train, hue='TARGET', ax=axes[i][j])
        current_plot.set_title(f"Distribution of the {categorical_attributes[plot_number]} Variable")
        current_plot.set_xticklabels(current_plot.get_xticklabels(), rotation=25)
        plot_number += 1

Important Categorical Features¶

In [17]:
#Important Categorical Features
sns.set(style="darkgrid") 
fig, axs = plt.subplots(2, 2, figsize=(10, 8)) 

def add_data_labels(ax): 
    total = float(len(df_train)) 
    for p in ax.patches: 
        count = p.get_height() 
        percentage = '{:.1f}%'.format(100 * count / total) 
        ax.annotate(f'{count} ({percentage})', (p.get_x() + p.get_width() / 2., p.get_height()), 
                    ha='center', va='center', fontsize=10, color='black', xytext=(0, 5), textcoords='offset points') 

sns.histplot(data=df_train, x="NAME_CONTRACT_TYPE", ax=axs[0, 0], color='green') 
add_data_labels(axs[0, 0]) 
sns.histplot(data=df_train, x="CODE_GENDER", ax=axs[0, 1], color='red') 
add_data_labels(axs[0, 1]) 
sns.histplot(data=df_train, x="FLAG_OWN_CAR", ax=axs[1, 0], color='blue') 
add_data_labels(axs[1, 0]) 
sns.histplot(data=df_train, x="FLAG_OWN_REALTY", ax=axs[1, 1], color='yellow') 
add_data_labels(axs[1, 1]) 
plt.show()

looking at Numerical Attributes in application train¶

In [31]:
run_analysis = True  # Set this to True if you would like to run, else set it to False
if run_analysis:
    numerical_attributes = ['TARGET', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'DAYS_EMPLOYED',
                            'DAYS_BIRTH', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'AMT_GOODS_PRICE']
    
    df_subset = df_train[numerical_attributes]
    df_subset['TARGET'].replace({0: "No Default", 1: "Default"}, inplace=True)
    df_subset.fillna(0, inplace=True)
    
    sns.pairplot(df_subset, hue="TARGET")

Most positive and negative Corelations¶

In [22]:
# Exclude non-numeric columns from correlation calculation
numeric_columns = df_train.select_dtypes(include=[np.number]).columns.tolist()
correlations = df_train[numeric_columns].corr()["TARGET"].sort_values()

print('Most Positive Correlations:\n', correlations.tail(10))
print('\nMost Negative Correlations:\n', correlations.head(10))
Most Positive Correlations:
 FLAG_DOCUMENT_3                0.044346
REG_CITY_NOT_LIVE_CITY         0.044395
FLAG_EMP_PHONE                 0.045982
REG_CITY_NOT_WORK_CITY         0.050994
DAYS_ID_PUBLISH                0.051457
DAYS_LAST_PHONE_CHANGE         0.055218
REGION_RATING_CLIENT           0.058899
REGION_RATING_CLIENT_W_CITY    0.060893
DAYS_BIRTH                     0.078239
TARGET                         1.000000
Name: TARGET, dtype: float64

Most Negative Correlations:
 EXT_SOURCE_3                 -0.178919
EXT_SOURCE_2                 -0.160472
EXT_SOURCE_1                 -0.155317
DAYS_EMPLOYED                -0.044932
FLOORSMAX_AVG                -0.044003
FLOORSMAX_MEDI               -0.043768
FLOORSMAX_MODE               -0.043226
AMT_GOODS_PRICE              -0.039645
REGION_POPULATION_RELATIVE   -0.037227
ELEVATORS_AVG                -0.034199
Name: TARGET, dtype: float64

Correlation Matrix for a few numerical variables¶

In [24]:
#Correlation Matrix for a few numerical variables
correlation_data = df_train[['TARGET', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']] 
correlation_matrix = correlation_data.corr()

plt.figure(figsize=(16, 14)) 
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=1) 
plt.title('Correlation Plot')
plt.show()

Corelation Matrix for Numerical Attributes¶

In [26]:
numerical_attributes = ['TARGET', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'DAYS_EMPLOYED',
                        'DAYS_BIRTH', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'AMT_GOODS_PRICE']

df_numerical = df_train[numerical_attributes]
correlation_matrix = df_numerical.corr()
correlation_matrix.style.background_gradient(cmap='coolwarm').set_precision(2)
Out[26]:
  TARGET AMT_INCOME_TOTAL AMT_CREDIT DAYS_EMPLOYED DAYS_BIRTH EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 AMT_GOODS_PRICE
TARGET 1.00 -0.00 -0.03 -0.04 0.08 -0.16 -0.16 -0.18 -0.04
AMT_INCOME_TOTAL -0.00 1.00 0.16 -0.06 0.03 0.03 0.06 -0.03 0.16
AMT_CREDIT -0.03 0.16 1.00 -0.07 -0.06 0.17 0.13 0.04 0.99
DAYS_EMPLOYED -0.04 -0.06 -0.07 1.00 -0.62 0.29 -0.02 0.11 -0.06
DAYS_BIRTH 0.08 0.03 -0.06 -0.62 1.00 -0.60 -0.09 -0.21 -0.05
EXT_SOURCE_1 -0.16 0.03 0.17 0.29 -0.60 1.00 0.21 0.19 0.18
EXT_SOURCE_2 -0.16 0.06 0.13 -0.02 -0.09 0.21 1.00 0.11 0.14
EXT_SOURCE_3 -0.18 -0.03 0.04 0.11 -0.21 0.19 0.11 1.00 0.05
AMT_GOODS_PRICE -0.04 0.16 0.99 -0.06 -0.05 0.18 0.14 0.05 1.00
In [36]:
pip install cufflinks
Collecting cufflinks
  Downloading cufflinks-0.17.3.tar.gz (81 kB)
     |████████████████████████████████| 81 kB 1.1 MB/s            
  Preparing metadata (setup.py) ... done
Requirement already satisfied: numpy>=1.9.2 in /usr/local/lib/python3.9/site-packages (from cufflinks) (1.22.0)
Requirement already satisfied: pandas>=0.19.2 in /usr/local/lib/python3.9/site-packages (from cufflinks) (1.3.5)
Collecting plotly>=4.1.1
  Downloading plotly-5.18.0-py3-none-any.whl (15.6 MB)
     |████████████████████████████████| 15.6 MB 5.0 MB/s            
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.9/site-packages (from cufflinks) (1.15.0)
Collecting colorlover>=0.2.1
  Downloading colorlover-0.3.0-py3-none-any.whl (8.9 kB)
Requirement already satisfied: setuptools>=34.4.1 in /usr/local/lib/python3.9/site-packages (from cufflinks) (60.5.0)
Requirement already satisfied: ipython>=5.3.0 in /usr/local/lib/python3.9/site-packages (from cufflinks) (7.31.0)
Requirement already satisfied: ipywidgets>=7.0.0 in /usr/local/lib/python3.9/site-packages (from cufflinks) (7.6.5)
Requirement already satisfied: pygments in /usr/local/lib/python3.9/site-packages (from ipython>=5.3.0->cufflinks) (2.11.2)
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.9/site-packages (from ipython>=5.3.0->cufflinks) (5.1.1)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.9/site-packages (from ipython>=5.3.0->cufflinks) (0.7.5)
Requirement already satisfied: matplotlib-inline in /usr/local/lib/python3.9/site-packages (from ipython>=5.3.0->cufflinks) (0.1.3)
Requirement already satisfied: pexpect>4.3 in /usr/local/lib/python3.9/site-packages (from ipython>=5.3.0->cufflinks) (4.8.0)
Requirement already satisfied: jedi>=0.16 in /usr/local/lib/python3.9/site-packages (from ipython>=5.3.0->cufflinks) (0.18.1)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /usr/local/lib/python3.9/site-packages (from ipython>=5.3.0->cufflinks) (3.0.24)
Requirement already satisfied: backcall in /usr/local/lib/python3.9/site-packages (from ipython>=5.3.0->cufflinks) (0.2.0)
Requirement already satisfied: decorator in /usr/local/lib/python3.9/site-packages (from ipython>=5.3.0->cufflinks) (5.1.1)
Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /usr/local/lib/python3.9/site-packages (from ipywidgets>=7.0.0->cufflinks) (1.0.2)
Requirement already satisfied: ipython-genutils~=0.2.0 in /usr/local/lib/python3.9/site-packages (from ipywidgets>=7.0.0->cufflinks) (0.2.0)
Requirement already satisfied: nbformat>=4.2.0 in /usr/local/lib/python3.9/site-packages (from ipywidgets>=7.0.0->cufflinks) (5.1.3)
Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.9/site-packages (from ipywidgets>=7.0.0->cufflinks) (6.6.1)
Requirement already satisfied: widgetsnbextension~=3.5.0 in /usr/local/lib/python3.9/site-packages (from ipywidgets>=7.0.0->cufflinks) (3.5.2)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.9/site-packages (from pandas>=0.19.2->cufflinks) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.9/site-packages (from pandas>=0.19.2->cufflinks) (2021.3)
Requirement already satisfied: packaging in /usr/local/lib/python3.9/site-packages (from plotly>=4.1.1->cufflinks) (21.3)
Collecting tenacity>=6.2.0
  Downloading tenacity-8.2.3-py3-none-any.whl (24 kB)
Requirement already satisfied: debugpy<2.0,>=1.0.0 in /usr/local/lib/python3.9/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (1.5.1)
Requirement already satisfied: jupyter-client<8.0 in /usr/local/lib/python3.9/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (7.1.0)
Requirement already satisfied: tornado<7.0,>=4.2 in /usr/local/lib/python3.9/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (6.1)
Requirement already satisfied: nest-asyncio in /usr/local/lib/python3.9/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (1.5.4)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /usr/local/lib/python3.9/site-packages (from jedi>=0.16->ipython>=5.3.0->cufflinks) (0.8.3)
Requirement already satisfied: jupyter-core in /usr/local/lib/python3.9/site-packages (from nbformat>=4.2.0->ipywidgets>=7.0.0->cufflinks) (4.9.1)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.9/site-packages (from nbformat>=4.2.0->ipywidgets>=7.0.0->cufflinks) (4.3.3)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.9/site-packages (from pexpect>4.3->ipython>=5.3.0->cufflinks) (0.7.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.9/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=5.3.0->cufflinks) (0.2.5)
Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.9/site-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (6.4.6)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.9/site-packages (from packaging->plotly>=4.1.1->cufflinks) (3.0.6)
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /usr/local/lib/python3.9/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.0.0->cufflinks) (0.18.0)
Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.9/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.0.0->cufflinks) (21.4.0)
Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.9/site-packages (from jupyter-client<8.0->ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (22.3.0)
Requirement already satisfied: entrypoints in /usr/local/lib/python3.9/site-packages (from jupyter-client<8.0->ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (0.3)
Requirement already satisfied: Send2Trash>=1.8.0 in /usr/local/lib/python3.9/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (1.8.0)
Requirement already satisfied: prometheus-client in /usr/local/lib/python3.9/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.12.0)
Requirement already satisfied: argon2-cffi in /usr/local/lib/python3.9/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (21.3.0)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.9/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (3.0.3)
Requirement already satisfied: nbconvert in /usr/local/lib/python3.9/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (6.4.0)
Requirement already satisfied: terminado>=0.8.3 in /usr/local/lib/python3.9/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.12.1)
Requirement already satisfied: argon2-cffi-bindings in /usr/local/lib/python3.9/site-packages (from argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (21.2.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.9/site-packages (from jinja2->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (2.0.1)
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.9/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.8.4)
Requirement already satisfied: testpath in /usr/local/lib/python3.9/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.5.0)
Requirement already satisfied: jupyterlab-pygments in /usr/local/lib/python3.9/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.1.2)
Requirement already satisfied: bleach in /usr/local/lib/python3.9/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (4.1.0)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.9/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (1.5.0)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.9/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.7.1)
Requirement already satisfied: nbclient<0.6.0,>=0.5.0 in /usr/local/lib/python3.9/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.5.9)
Requirement already satisfied: cffi>=1.0.1 in /usr/local/lib/python3.9/site-packages (from argon2-cffi-bindings->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (1.14.6)
Requirement already satisfied: webencodings in /usr/local/lib/python3.9/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.5.1)
Requirement already satisfied: pycparser in /usr/local/lib/python3.9/site-packages (from cffi>=1.0.1->argon2-cffi-bindings->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (2.21)
Building wheels for collected packages: cufflinks
  Building wheel for cufflinks (setup.py) ... done
  Created wheel for cufflinks: filename=cufflinks-0.17.3-py3-none-any.whl size=67918 sha256=00c43ea05d626043a3d24e2f1dfb877eca009674bcef5296e16b0546a3095e9a
  Stored in directory: /root/.cache/pip/wheels/29/b4/f8/2fd2206eeeba6ccad8167e4e8894b8c4ec27bf1342037fd136
Successfully built cufflinks
Installing collected packages: tenacity, plotly, colorlover, cufflinks
Successfully installed colorlover-0.3.0 cufflinks-0.17.3 plotly-5.18.0 tenacity-8.2.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 21.3.1; however, version 23.3.1 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.
In [38]:
pip install chart_studio
Collecting chart_studio
  Downloading chart_studio-1.1.0-py3-none-any.whl (64 kB)
     |████████████████████████████████| 64 kB 369 kB/s             
Requirement already satisfied: plotly in /usr/local/lib/python3.9/site-packages (from chart_studio) (5.18.0)
Collecting retrying>=1.3.3
  Downloading retrying-1.3.4-py3-none-any.whl (11 kB)
Requirement already satisfied: requests in /usr/local/lib/python3.9/site-packages (from chart_studio) (2.26.0)
Requirement already satisfied: six in /usr/local/lib/python3.9/site-packages (from chart_studio) (1.15.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.9/site-packages (from plotly->chart_studio) (21.3)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.9/site-packages (from plotly->chart_studio) (8.2.3)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/site-packages (from requests->chart_studio) (3.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/site-packages (from requests->chart_studio) (2021.10.8)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.9/site-packages (from requests->chart_studio) (2.0.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/site-packages (from requests->chart_studio) (1.26.7)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.9/site-packages (from packaging->plotly->chart_studio) (3.0.6)
Installing collected packages: retrying, chart-studio
Successfully installed chart-studio-1.1.0 retrying-1.3.4
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 21.3.1; however, version 23.3.1 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.
In [28]:
pip install --upgrade pandas cufflinks
Requirement already satisfied: pandas in /usr/local/lib/python3.9/site-packages (1.3.5)
Collecting pandas
  Downloading pandas-2.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
     |████████████████████████████████| 12.3 MB 1.9 MB/s            
Requirement already satisfied: cufflinks in /usr/local/lib/python3.9/site-packages (0.17.3)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.9/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.9/site-packages (from pandas) (2021.3)
Collecting tzdata>=2022.1
  Downloading tzdata-2023.3-py2.py3-none-any.whl (341 kB)
     |████████████████████████████████| 341 kB 2.7 MB/s            
Collecting numpy<2,>=1.22.4
  Downloading numpy-1.26.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
     |████████████████████████████████| 18.2 MB 2.9 MB/s            
Requirement already satisfied: setuptools>=34.4.1 in /usr/local/lib/python3.9/site-packages (from cufflinks) (60.5.0)
Requirement already satisfied: ipython>=5.3.0 in /usr/local/lib/python3.9/site-packages (from cufflinks) (7.31.0)
Requirement already satisfied: ipywidgets>=7.0.0 in /usr/local/lib/python3.9/site-packages (from cufflinks) (7.6.5)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.9/site-packages (from cufflinks) (1.15.0)
Requirement already satisfied: colorlover>=0.2.1 in /usr/local/lib/python3.9/site-packages (from cufflinks) (0.3.0)
Requirement already satisfied: plotly>=4.1.1 in /usr/local/lib/python3.9/site-packages (from cufflinks) (5.18.0)
Requirement already satisfied: pexpect>4.3 in /usr/local/lib/python3.9/site-packages (from ipython>=5.3.0->cufflinks) (4.8.0)
Requirement already satisfied: decorator in /usr/local/lib/python3.9/site-packages (from ipython>=5.3.0->cufflinks) (5.1.1)
Requirement already satisfied: backcall in /usr/local/lib/python3.9/site-packages (from ipython>=5.3.0->cufflinks) (0.2.0)
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.9/site-packages (from ipython>=5.3.0->cufflinks) (5.1.1)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.9/site-packages (from ipython>=5.3.0->cufflinks) (0.7.5)
Requirement already satisfied: jedi>=0.16 in /usr/local/lib/python3.9/site-packages (from ipython>=5.3.0->cufflinks) (0.18.1)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /usr/local/lib/python3.9/site-packages (from ipython>=5.3.0->cufflinks) (3.0.24)
Requirement already satisfied: matplotlib-inline in /usr/local/lib/python3.9/site-packages (from ipython>=5.3.0->cufflinks) (0.1.3)
Requirement already satisfied: pygments in /usr/local/lib/python3.9/site-packages (from ipython>=5.3.0->cufflinks) (2.11.2)
Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.9/site-packages (from ipywidgets>=7.0.0->cufflinks) (6.6.1)
Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /usr/local/lib/python3.9/site-packages (from ipywidgets>=7.0.0->cufflinks) (1.0.2)
Requirement already satisfied: ipython-genutils~=0.2.0 in /usr/local/lib/python3.9/site-packages (from ipywidgets>=7.0.0->cufflinks) (0.2.0)
Requirement already satisfied: nbformat>=4.2.0 in /usr/local/lib/python3.9/site-packages (from ipywidgets>=7.0.0->cufflinks) (5.1.3)
Requirement already satisfied: widgetsnbextension~=3.5.0 in /usr/local/lib/python3.9/site-packages (from ipywidgets>=7.0.0->cufflinks) (3.5.2)
Requirement already satisfied: packaging in /usr/local/lib/python3.9/site-packages (from plotly>=4.1.1->cufflinks) (21.3)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.9/site-packages (from plotly>=4.1.1->cufflinks) (8.2.3)
Requirement already satisfied: nest-asyncio in /usr/local/lib/python3.9/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (1.5.4)
Requirement already satisfied: tornado<7.0,>=4.2 in /usr/local/lib/python3.9/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (6.1)
Requirement already satisfied: debugpy<2.0,>=1.0.0 in /usr/local/lib/python3.9/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (1.5.1)
Requirement already satisfied: jupyter-client<8.0 in /usr/local/lib/python3.9/site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (7.1.0)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /usr/local/lib/python3.9/site-packages (from jedi>=0.16->ipython>=5.3.0->cufflinks) (0.8.3)
Requirement already satisfied: jupyter-core in /usr/local/lib/python3.9/site-packages (from nbformat>=4.2.0->ipywidgets>=7.0.0->cufflinks) (4.9.1)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.9/site-packages (from nbformat>=4.2.0->ipywidgets>=7.0.0->cufflinks) (4.3.3)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.9/site-packages (from pexpect>4.3->ipython>=5.3.0->cufflinks) (0.7.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.9/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=5.3.0->cufflinks) (0.2.5)
Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.9/site-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (6.4.6)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.9/site-packages (from packaging->plotly>=4.1.1->cufflinks) (3.0.6)
Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.9/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.0.0->cufflinks) (21.4.0)
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /usr/local/lib/python3.9/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.0.0->cufflinks) (0.18.0)
Requirement already satisfied: entrypoints in /usr/local/lib/python3.9/site-packages (from jupyter-client<8.0->ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (0.3)
Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.9/site-packages (from jupyter-client<8.0->ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (22.3.0)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.9/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (3.0.3)
Requirement already satisfied: terminado>=0.8.3 in /usr/local/lib/python3.9/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.12.1)
Requirement already satisfied: argon2-cffi in /usr/local/lib/python3.9/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (21.3.0)
Requirement already satisfied: prometheus-client in /usr/local/lib/python3.9/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.12.0)
Requirement already satisfied: nbconvert in /usr/local/lib/python3.9/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (6.4.0)
Requirement already satisfied: Send2Trash>=1.8.0 in /usr/local/lib/python3.9/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (1.8.0)
Requirement already satisfied: argon2-cffi-bindings in /usr/local/lib/python3.9/site-packages (from argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (21.2.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.9/site-packages (from jinja2->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (2.0.1)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.9/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (1.5.0)
Requirement already satisfied: nbclient<0.6.0,>=0.5.0 in /usr/local/lib/python3.9/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.5.9)
Requirement already satisfied: testpath in /usr/local/lib/python3.9/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.5.0)
Requirement already satisfied: bleach in /usr/local/lib/python3.9/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (4.1.0)
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.9/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.8.4)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.9/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.7.1)
Requirement already satisfied: jupyterlab-pygments in /usr/local/lib/python3.9/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.1.2)
Requirement already satisfied: cffi>=1.0.1 in /usr/local/lib/python3.9/site-packages (from argon2-cffi-bindings->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (1.14.6)
Requirement already satisfied: webencodings in /usr/local/lib/python3.9/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.5.1)
Requirement already satisfied: pycparser in /usr/local/lib/python3.9/site-packages (from cffi>=1.0.1->argon2-cffi-bindings->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (2.21)
Installing collected packages: tzdata, numpy, pandas
  Attempting uninstall: numpy
    Found existing installation: numpy 1.22.0
    Uninstalling numpy-1.22.0:
      Successfully uninstalled numpy-1.22.0
  Attempting uninstall: pandas
    Found existing installation: pandas 1.3.5
    Uninstalling pandas-1.3.5:
      Successfully uninstalled pandas-1.3.5
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scipy 1.7.3 requires numpy<1.23.0,>=1.16.5, but you have numpy 1.26.2 which is incompatible.
basemap 1.3.0 requires numpy<1.22,>=1.16; python_version >= "3.5", but you have numpy 1.26.2 which is incompatible.
Successfully installed numpy-1.26.2 pandas-2.1.3 tzdata-2023.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 21.3.1; however, version 23.3.1 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.

Distribution among the output class¶

In [103]:
import plotly.express as px

contract_type_counts = df_train['NAME_CONTRACT_TYPE'].value_counts()
contract_type_df = pd.DataFrame({'labels': contract_type_counts.index,
                                  'values': contract_type_counts.values
                                 })

fig = px.pie(contract_type_df, names='labels', values='values', title='Distribution of Loan Types')
fig.update_traces(hole=0.6)

fig.show()

Total Income Amount less than 2000000¶

In [32]:
pip install --upgrade plotly
Requirement already satisfied: plotly in /usr/local/lib/python3.9/site-packages (5.18.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.9/site-packages (from plotly) (21.3)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.9/site-packages (from plotly) (8.2.3)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.9/site-packages (from packaging->plotly) (3.0.6)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 21.3.1; however, version 23.3.1 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.
In [34]:
import matplotlib.pyplot as plt

income_filter = df_train[df_train['AMT_INCOME_TOTAL'] < 2000000]

plt.figure(figsize=(10, 6))
plt.hist(income_filter['AMT_INCOME_TOTAL'], bins=100, color='blue', edgecolor='black')
plt.title('Distribution of Income (Filtered)')
plt.xlabel('Total Income')
plt.ylabel('Count of Applicants')
plt.show()

Distribution of Credit Amount¶

In [35]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.hist(df_train['AMT_CREDIT'], bins=100, color='green', edgecolor='black')
plt.title('Distribution of Credit Amount')
plt.xlabel('Credit Amount')
plt.ylabel('Count of Applicants')
plt.show()

Contract Type with Amount Credit and Code Gender¶

In [25]:
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming df_train is your DataFrame containing the HCDR dataset

# Set a custom color palette for the plot
custom_palette = sns.color_palette("Set2")

# Set up the figure and axes
plt.figure(figsize=(12, 8))

# Box plot for 'AMT_CREDIT' with 'NAME_CONTRACT_TYPE' as huehttp://localhost:8888/notebooks/Courses/AML526/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk/Phase2/HCDR_baseLine_submission_with_numerical_and_cat_features_to_kaggle.ipynb#Summary-of-the-Applications-Dataset-and-Missing-Data
sns.boxplot(x='NAME_CONTRACT_TYPE', y='AMT_CREDIT', hue='CODE_GENDER', data=df_train, palette=custom_palette)

# Set plot labels and title
plt.xlabel('Contract Type')
plt.ylabel('Credit Amount')
plt.title('Box Plot of Credit Amount by Contract Type and Gender')

# Customize legend
plt.legend(title='Gender')

# Show the plot
plt.show()

EDA for Previous Applications Dataset¶

In [16]:
prevApp = datasets['previous_application']

Summary of the Applications Dataset and Missing Data¶

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML

# Assuming datasets is a DataFrame containing your data

def stats_summary1(df, df_name):
    print(datasets[df_name].info(verbose=True))
    print("-----" * 15)
    print(f"Shape of the df {df_name} is {df.shape} \n")
    print("-----" * 15)
    print(f"Statistical summary of {df_name} is :")
    print("-----" * 15)
    print(f"Description of the df {df_name}:\n")
    print(display(HTML(np.round(datasets['application_train'].describe(), 2).to_html())))

def stats_summary2(df, df_name):
    print(f"Description of the df continued for {df_name}:\n")
    print("-----" * 15)
    print("Data type value counts: \n", df.dtypes.value_counts())
    print("\nReturn the number of unique elements in the object. \n")
    print(df.select_dtypes('object').apply(pd.Series.nunique, axis=0))

# List the categorical and Numerical features of a DF
def feature_datatypes_groups(df, df_name):
    df_dtypes = df.columns.to_series().groupby(df.dtypes).groups
    print("-----" * 15)
    print(f"Categorical and Numerical(int + float) features  of {df_name}.")
    print("-----" * 15)
    print()
    for k, v in df_dtypes.items():
        print({k.name: v})
        print("---" * 10)
    print("\n \n")

# Null data list and plot.
def null_data_plot(df, df_name):
    percent = (df.isnull().sum() / df.isnull().count() * 100).sort_values(ascending=False).round(2)
    sum_missing = df.isna().sum().sort_values(ascending=False)
    missing_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
    missing_data = missing_data[missing_data['Percent'] > 0]
    print("-----" * 15)
    print("-----" * 15)
    print('\n The Missing Data: \n')
    if len(missing_data) == 0:
        print("No missing Data")
    else:
        display(HTML(missing_data.to_html()))  # display all the rows
        print("-----" * 15)

        if len(df.columns) > 35:
            f, ax = plt.subplots(figsize=(8, 15))
        else:
            f, ax = plt.subplots()

        plt.title(f'Percent missing data for {df_name}.', fontsize=10)
        fig = sns.barplot(x="Percent", y=missing_data.index, data=missing_data, alpha=0.8)
        plt.xlabel('Percent of missing values', fontsize=10)
        plt.ylabel('Features', fontsize=10)
        return missing_data

# Full consolidation of all the stats function.
def display_stats(df, df_name):
    print("--" * 40)
    print(" " * 20 + '\033[1m' + df_name + '\033[0m' + " " * 20)
    print("--" * 40)
    stats_summary1(df, df_name)

def display_feature_info(df, df_name):
    stats_summary2(df, df_name)
    feature_datatypes_groups(df, df_name)
    null_data_plot(df, df_name)

# Example usage:
# Assuming 'datasets' is your DataFrame and 'application_train' is one of its components
display_stats(datasets["previous_application"], "previous_application")
display_feature_info(datasets["previous_application"], "previous_application")
--------------------------------------------------------------------------------
                    previous_application                    
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1670214 entries, 0 to 1670213
Data columns (total 37 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   SK_ID_PREV                   1670214 non-null  int64  
 1   SK_ID_CURR                   1670214 non-null  int64  
 2   NAME_CONTRACT_TYPE           1670214 non-null  object 
 3   AMT_ANNUITY                  1297979 non-null  float64
 4   AMT_APPLICATION              1670214 non-null  float64
 5   AMT_CREDIT                   1670213 non-null  float64
 6   AMT_DOWN_PAYMENT             774370 non-null   float64
 7   AMT_GOODS_PRICE              1284699 non-null  float64
 8   WEEKDAY_APPR_PROCESS_START   1670214 non-null  object 
 9   HOUR_APPR_PROCESS_START      1670214 non-null  int64  
 10  FLAG_LAST_APPL_PER_CONTRACT  1670214 non-null  object 
 11  NFLAG_LAST_APPL_IN_DAY       1670214 non-null  int64  
 12  RATE_DOWN_PAYMENT            774370 non-null   float64
 13  RATE_INTEREST_PRIMARY        5951 non-null     float64
 14  RATE_INTEREST_PRIVILEGED     5951 non-null     float64
 15  NAME_CASH_LOAN_PURPOSE       1670214 non-null  object 
 16  NAME_CONTRACT_STATUS         1670214 non-null  object 
 17  DAYS_DECISION                1670214 non-null  int64  
 18  NAME_PAYMENT_TYPE            1670214 non-null  object 
 19  CODE_REJECT_REASON           1670214 non-null  object 
 20  NAME_TYPE_SUITE              849809 non-null   object 
 21  NAME_CLIENT_TYPE             1670214 non-null  object 
 22  NAME_GOODS_CATEGORY          1670214 non-null  object 
 23  NAME_PORTFOLIO               1670214 non-null  object 
 24  NAME_PRODUCT_TYPE            1670214 non-null  object 
 25  CHANNEL_TYPE                 1670214 non-null  object 
 26  SELLERPLACE_AREA             1670214 non-null  int64  
 27  NAME_SELLER_INDUSTRY         1670214 non-null  object 
 28  CNT_PAYMENT                  1297984 non-null  float64
 29  NAME_YIELD_GROUP             1670214 non-null  object 
 30  PRODUCT_COMBINATION          1669868 non-null  object 
 31  DAYS_FIRST_DRAWING           997149 non-null   float64
 32  DAYS_FIRST_DUE               997149 non-null   float64
 33  DAYS_LAST_DUE_1ST_VERSION    997149 non-null   float64
 34  DAYS_LAST_DUE                997149 non-null   float64
 35  DAYS_TERMINATION             997149 non-null   float64
 36  NFLAG_INSURED_ON_APPROVAL    997149 non-null   float64
dtypes: float64(15), int64(6), object(16)
memory usage: 471.5+ MB
None
---------------------------------------------------------------------------
Shape of the df previous_application is (1670214, 37) 

---------------------------------------------------------------------------
Statistical summary of previous_application is :
---------------------------------------------------------------------------
Description of the df previous_application:

SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.00 307511.00 307511.00 3.075110e+05 307511.00 307499.00 307233.00 307511.00 307511.00 307511.00 307511.00 307511.00 104582.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307509.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 134133.00 306851.00 246546.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 159080.00 306490.00 306490.00 306490.00 306490.00 307510.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 265992.00 265992.00 265992.00 265992.00 265992.00 265992.00
mean 278180.52 0.08 0.42 1.687979e+05 599026.00 27108.57 538396.21 0.02 -16037.00 63815.05 -4986.12 -2994.20 12.06 1.0 0.82 0.2 1.00 0.28 0.06 2.15 2.05 2.03 12.06 0.02 0.05 0.04 0.08 0.23 0.18 0.50 0.51 0.51 0.12 0.09 0.98 0.75 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.11 0.09 0.98 0.76 0.04 0.07 0.15 0.22 0.23 0.06 0.11 0.11 0.01 0.03 0.12 0.09 0.98 0.76 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.10 1.42 0.14 1.41 0.10 -962.86 0.00 0.71 0.00 0.02 0.09 0.00 0.08 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.03 0.27 0.27 1.90
std 102790.18 0.27 0.72 2.371231e+05 402490.78 14493.74 369446.46 0.01 4363.99 141275.77 3522.89 1509.45 11.94 0.0 0.38 0.4 0.04 0.45 0.23 0.91 0.51 0.50 3.27 0.12 0.22 0.20 0.27 0.42 0.38 0.21 0.19 0.19 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.14 0.16 0.08 0.09 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.07 0.13 0.10 0.14 0.16 0.08 0.10 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.15 0.16 0.08 0.09 0.11 0.05 0.07 0.11 2.40 0.45 2.38 0.36 826.81 0.01 0.45 0.01 0.12 0.28 0.01 0.27 0.06 0.0 0.06 0.0 0.06 0.05 0.03 0.10 0.02 0.09 0.02 0.02 0.02 0.08 0.11 0.20 0.92 0.79 1.87
min 100002.00 0.00 0.00 2.565000e+04 45000.00 1615.50 40500.00 0.00 -25229.00 -17912.00 -24672.00 -7197.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -4292.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 189145.50 0.00 0.00 1.125000e+05 270000.00 16524.00 238500.00 0.01 -19682.00 -2760.00 -7479.50 -4299.00 5.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.39 0.37 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.05 0.04 0.98 0.70 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.04 0.00 0.00 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.04 0.00 0.00 0.00 0.00 -1570.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 278202.00 0.00 0.00 1.471500e+05 513531.00 24903.00 450000.00 0.02 -15750.00 -1213.00 -4504.00 -3254.00 9.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.57 0.54 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.08 0.07 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.07 0.00 0.00 0.00 0.00 -757.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
75% 367142.50 0.00 1.00 2.025000e+05 808650.00 34596.00 679500.00 0.03 -12413.00 -289.00 -2010.00 -1720.00 15.00 1.0 1.00 0.0 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.66 0.67 0.15 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.14 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.08 0.13 0.13 0.00 0.02 0.15 0.11 0.99 0.83 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.13 2.00 0.00 2.00 0.00 -274.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
max 456255.00 1.00 19.00 1.170000e+08 4050000.00 258025.50 4050000.00 0.07 -7489.00 365243.00 0.00 0.00 91.00 1.0 1.00 1.0 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.85 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 348.00 34.00 344.00 24.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0 1.00 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00
None
Description of the df continued for previous_application:

---------------------------------------------------------------------------
Data type value counts: 
 object     16
float64    15
int64       6
Name: count, dtype: int64

Return the number of unique elements in the object. 

NAME_CONTRACT_TYPE              4
WEEKDAY_APPR_PROCESS_START      7
FLAG_LAST_APPL_PER_CONTRACT     2
NAME_CASH_LOAN_PURPOSE         25
NAME_CONTRACT_STATUS            4
NAME_PAYMENT_TYPE               4
CODE_REJECT_REASON              9
NAME_TYPE_SUITE                 7
NAME_CLIENT_TYPE                4
NAME_GOODS_CATEGORY            28
NAME_PORTFOLIO                  5
NAME_PRODUCT_TYPE               3
CHANNEL_TYPE                    8
NAME_SELLER_INDUSTRY           11
NAME_YIELD_GROUP                5
PRODUCT_COMBINATION            17
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features  of previous_application.
---------------------------------------------------------------------------

{'int64': Index(['SK_ID_PREV', 'SK_ID_CURR', 'HOUR_APPR_PROCESS_START',
       'NFLAG_LAST_APPL_IN_DAY', 'DAYS_DECISION', 'SELLERPLACE_AREA'],
      dtype='object')}
------------------------------
{'float64': Index(['AMT_ANNUITY', 'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT',
       'AMT_GOODS_PRICE', 'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
       'RATE_INTEREST_PRIVILEGED', 'CNT_PAYMENT', 'DAYS_FIRST_DRAWING',
       'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE',
       'DAYS_TERMINATION', 'NFLAG_INSURED_ON_APPROVAL'],
      dtype='object')}
------------------------------
{'object': Index(['NAME_CONTRACT_TYPE', 'WEEKDAY_APPR_PROCESS_START',
       'FLAG_LAST_APPL_PER_CONTRACT', 'NAME_CASH_LOAN_PURPOSE',
       'NAME_CONTRACT_STATUS', 'NAME_PAYMENT_TYPE', 'CODE_REJECT_REASON',
       'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE', 'NAME_GOODS_CATEGORY',
       'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE', 'CHANNEL_TYPE',
       'NAME_SELLER_INDUSTRY', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION'],
      dtype='object')}
------------------------------

 

---------------------------------------------------------------------------
---------------------------------------------------------------------------

 The Missing Data: 

Percent Train Missing Count
RATE_INTEREST_PRIVILEGED 99.64 1664263
RATE_INTEREST_PRIMARY 99.64 1664263
AMT_DOWN_PAYMENT 53.64 895844
RATE_DOWN_PAYMENT 53.64 895844
NAME_TYPE_SUITE 49.12 820405
NFLAG_INSURED_ON_APPROVAL 40.30 673065
DAYS_TERMINATION 40.30 673065
DAYS_LAST_DUE 40.30 673065
DAYS_LAST_DUE_1ST_VERSION 40.30 673065
DAYS_FIRST_DUE 40.30 673065
DAYS_FIRST_DRAWING 40.30 673065
AMT_GOODS_PRICE 23.08 385515
AMT_ANNUITY 22.29 372235
CNT_PAYMENT 22.29 372230
PRODUCT_COMBINATION 0.02 346
---------------------------------------------------------------------------

Statistical Summary of previous_application¶

In [12]:
display_stats(datasets['previous_application'], 'previous_application')
--------------------------------------------------------------------------------
                    previous_application                    
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1670214 entries, 0 to 1670213
Data columns (total 37 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   SK_ID_PREV                   1670214 non-null  int64  
 1   SK_ID_CURR                   1670214 non-null  int64  
 2   NAME_CONTRACT_TYPE           1670214 non-null  object 
 3   AMT_ANNUITY                  1297979 non-null  float64
 4   AMT_APPLICATION              1670214 non-null  float64
 5   AMT_CREDIT                   1670213 non-null  float64
 6   AMT_DOWN_PAYMENT             774370 non-null   float64
 7   AMT_GOODS_PRICE              1284699 non-null  float64
 8   WEEKDAY_APPR_PROCESS_START   1670214 non-null  object 
 9   HOUR_APPR_PROCESS_START      1670214 non-null  int64  
 10  FLAG_LAST_APPL_PER_CONTRACT  1670214 non-null  object 
 11  NFLAG_LAST_APPL_IN_DAY       1670214 non-null  int64  
 12  RATE_DOWN_PAYMENT            774370 non-null   float64
 13  RATE_INTEREST_PRIMARY        5951 non-null     float64
 14  RATE_INTEREST_PRIVILEGED     5951 non-null     float64
 15  NAME_CASH_LOAN_PURPOSE       1670214 non-null  object 
 16  NAME_CONTRACT_STATUS         1670214 non-null  object 
 17  DAYS_DECISION                1670214 non-null  int64  
 18  NAME_PAYMENT_TYPE            1670214 non-null  object 
 19  CODE_REJECT_REASON           1670214 non-null  object 
 20  NAME_TYPE_SUITE              849809 non-null   object 
 21  NAME_CLIENT_TYPE             1670214 non-null  object 
 22  NAME_GOODS_CATEGORY          1670214 non-null  object 
 23  NAME_PORTFOLIO               1670214 non-null  object 
 24  NAME_PRODUCT_TYPE            1670214 non-null  object 
 25  CHANNEL_TYPE                 1670214 non-null  object 
 26  SELLERPLACE_AREA             1670214 non-null  int64  
 27  NAME_SELLER_INDUSTRY         1670214 non-null  object 
 28  CNT_PAYMENT                  1297984 non-null  float64
 29  NAME_YIELD_GROUP             1670214 non-null  object 
 30  PRODUCT_COMBINATION          1669868 non-null  object 
 31  DAYS_FIRST_DRAWING           997149 non-null   float64
 32  DAYS_FIRST_DUE               997149 non-null   float64
 33  DAYS_LAST_DUE_1ST_VERSION    997149 non-null   float64
 34  DAYS_LAST_DUE                997149 non-null   float64
 35  DAYS_TERMINATION             997149 non-null   float64
 36  NFLAG_INSURED_ON_APPROVAL    997149 non-null   float64
dtypes: float64(15), int64(6), object(16)
memory usage: 471.5+ MB
None
---------------------------------------------------------------------------
Shape of the df previous_application is (1670214, 37) 

---------------------------------------------------------------------------
Statistical summary of previous_application is :
---------------------------------------------------------------------------
Description of the df previous_application:

SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.00 307511.00 307511.00 3.075110e+05 307511.00 307499.00 307233.00 307511.00 307511.00 307511.00 307511.00 307511.00 104582.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307509.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 134133.00 306851.00 246546.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 159080.00 306490.00 306490.00 306490.00 306490.00 307510.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 265992.00 265992.00 265992.00 265992.00 265992.00 265992.00
mean 278180.52 0.08 0.42 1.687979e+05 599026.00 27108.57 538396.21 0.02 -16037.00 63815.05 -4986.12 -2994.20 12.06 1.0 0.82 0.2 1.00 0.28 0.06 2.15 2.05 2.03 12.06 0.02 0.05 0.04 0.08 0.23 0.18 0.50 0.51 0.51 0.12 0.09 0.98 0.75 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.11 0.09 0.98 0.76 0.04 0.07 0.15 0.22 0.23 0.06 0.11 0.11 0.01 0.03 0.12 0.09 0.98 0.76 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.10 1.42 0.14 1.41 0.10 -962.86 0.00 0.71 0.00 0.02 0.09 0.00 0.08 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.03 0.27 0.27 1.90
std 102790.18 0.27 0.72 2.371231e+05 402490.78 14493.74 369446.46 0.01 4363.99 141275.77 3522.89 1509.45 11.94 0.0 0.38 0.4 0.04 0.45 0.23 0.91 0.51 0.50 3.27 0.12 0.22 0.20 0.27 0.42 0.38 0.21 0.19 0.19 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.14 0.16 0.08 0.09 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.07 0.13 0.10 0.14 0.16 0.08 0.10 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.15 0.16 0.08 0.09 0.11 0.05 0.07 0.11 2.40 0.45 2.38 0.36 826.81 0.01 0.45 0.01 0.12 0.28 0.01 0.27 0.06 0.0 0.06 0.0 0.06 0.05 0.03 0.10 0.02 0.09 0.02 0.02 0.02 0.08 0.11 0.20 0.92 0.79 1.87
min 100002.00 0.00 0.00 2.565000e+04 45000.00 1615.50 40500.00 0.00 -25229.00 -17912.00 -24672.00 -7197.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -4292.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 189145.50 0.00 0.00 1.125000e+05 270000.00 16524.00 238500.00 0.01 -19682.00 -2760.00 -7479.50 -4299.00 5.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.39 0.37 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.05 0.04 0.98 0.70 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.04 0.00 0.00 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.04 0.00 0.00 0.00 0.00 -1570.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 278202.00 0.00 0.00 1.471500e+05 513531.00 24903.00 450000.00 0.02 -15750.00 -1213.00 -4504.00 -3254.00 9.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.57 0.54 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.08 0.07 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.07 0.00 0.00 0.00 0.00 -757.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
75% 367142.50 0.00 1.00 2.025000e+05 808650.00 34596.00 679500.00 0.03 -12413.00 -289.00 -2010.00 -1720.00 15.00 1.0 1.00 0.0 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.66 0.67 0.15 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.14 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.08 0.13 0.13 0.00 0.02 0.15 0.11 0.99 0.83 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.13 2.00 0.00 2.00 0.00 -274.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
max 456255.00 1.00 19.00 1.170000e+08 4050000.00 258025.50 4050000.00 0.07 -7489.00 365243.00 0.00 0.00 91.00 1.0 1.00 1.0 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.85 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 348.00 34.00 344.00 24.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0 1.00 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00
None

Finding 3¶

The number of children reaches as high as 19, which suggests a potential outlier that warrants further investigation.
Additionally, the presence of negative values in all the "number of day" fields indicates anomalies in the data.
Nevertheless, certain fields provide information on average years. Conducting a calculation to compare the average years with the corresponding days could yield valuable insights.

Columns in Previous_Applications¶

In [17]:
prevApp.columns
Out[17]:
Index(['SK_ID_PREV', 'SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'AMT_ANNUITY',
       'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
       'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
       'FLAG_LAST_APPL_PER_CONTRACT', 'NFLAG_LAST_APPL_IN_DAY',
       'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
       'RATE_INTEREST_PRIVILEGED', 'NAME_CASH_LOAN_PURPOSE',
       'NAME_CONTRACT_STATUS', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
       'CODE_REJECT_REASON', 'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE',
       'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE',
       'CHANNEL_TYPE', 'SELLERPLACE_AREA', 'NAME_SELLER_INDUSTRY',
       'CNT_PAYMENT', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION',
       'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
       'DAYS_LAST_DUE', 'DAYS_TERMINATION', 'NFLAG_INSURED_ON_APPROVAL'],
      dtype='object')

Some Categoricl features¶

In [18]:
#Important Categorical Features
sns.set(style="darkgrid") 
fig, axs = plt.subplots(2, 2, figsize=(16, 10)) 

def add_data_labels(ax): 
    total = float(len(prevApp)) 
    for p in ax.patches: 
        count = p.get_height() 
        percentage = '{:.1f}%'.format(100 * count / total) 
        ax.annotate(f'{count} ({percentage})', (p.get_x() + p.get_width() / 2., p.get_height()), 
                    ha='center', va='center', fontsize=10, color='black', xytext=(0, 5), textcoords='offset points') 

sns.histplot(data=prevApp, x="NAME_CONTRACT_TYPE", ax=axs[0, 0], color='green') 
add_data_labels(axs[0, 0]) 
sns.histplot(data=prevApp, x="NAME_CONTRACT_STATUS", ax=axs[0, 1], color='red') 
add_data_labels(axs[0, 1]) 
sns.histplot(data=prevApp, x="NAME_YIELD_GROUP", ax=axs[1, 0], color='blue') 
add_data_labels(axs[1, 0]) 
sns.histplot(data=prevApp, x="NAME_PORTFOLIO", ax=axs[1, 1], color='yellow') 
add_data_labels(axs[1, 1]) 
plt.show()

Correlation Plot between important features¶

In [19]:
# Correlation Plot between important features
correlation_data = prevApp[['AMT_ANNUITY', 'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE', 'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY', 'RATE_INTEREST_PRIVILEGED', 'CNT_PAYMENT']]
correlation_matrix = correlation_data.corr() 

plt.figure(figsize=(16, 12)) 
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=1) 
plt.title('Correlation Plot for Previous Application')
plt.show()

EDA for Bureau¶

Summary of the Bureau Dataset and Missing Data¶

In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML

# Assuming datasets is a DataFrame containing your data

def stats_summary1(df, df_name):
    print(datasets[df_name].info(verbose=True))
    print("-----" * 15)
    print(f"Shape of the df {df_name} is {df.shape} \n")
    print("-----" * 15)
    print(f"Statistical summary of {df_name} is :")
    print("-----" * 15)
    print(f"Description of the df {df_name}:\n")
    print(display(HTML(np.round(datasets['application_train'].describe(), 2).to_html())))

def stats_summary2(df, df_name):
    print(f"Description of the df continued for {df_name}:\n")
    print("-----" * 15)
    print("Data type value counts: \n", df.dtypes.value_counts())
    print("\nReturn the number of unique elements in the object. \n")
    print(df.select_dtypes('object').apply(pd.Series.nunique, axis=0))

# List the categorical and Numerical features of a DF
def feature_datatypes_groups(df, df_name):
    df_dtypes = df.columns.to_series().groupby(df.dtypes).groups
    print("-----" * 15)
    print(f"Categorical and Numerical(int + float) features  of {df_name}.")
    print("-----" * 15)
    print()
    for k, v in df_dtypes.items():
        print({k.name: v})
        print("---" * 10)
    print("\n \n")

# Null data list and plot.
def null_data_plot(df, df_name):
    percent = (df.isnull().sum() / df.isnull().count() * 100).sort_values(ascending=False).round(2)
    sum_missing = df.isna().sum().sort_values(ascending=False)
    missing_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
    missing_data = missing_data[missing_data['Percent'] > 0]
    print("-----" * 15)
    print("-----" * 15)
    print('\n The Missing Data: \n')
    if len(missing_data) == 0:
        print("No missing Data")
    else:
        display(HTML(missing_data.to_html()))  # display all the rows
        print("-----" * 15)

        if len(df.columns) > 35:
            f, ax = plt.subplots(figsize=(8, 15))
        else:
            f, ax = plt.subplots()

        plt.title(f'Percent missing data for {df_name}.', fontsize=10)
        fig = sns.barplot(x="Percent", y=missing_data.index, data=missing_data, alpha=0.8)
        plt.xlabel('Percent of missing values', fontsize=10)
        plt.ylabel('Features', fontsize=10)
        return missing_data

# Full consolidation of all the stats function.
def display_stats(df, df_name):
    print("--" * 40)
    print(" " * 20 + '\033[1m' + df_name + '\033[0m' + " " * 20)
    print("--" * 40)
    stats_summary1(df, df_name)

def display_feature_info(df, df_name):
    stats_summary2(df, df_name)
    feature_datatypes_groups(df, df_name)
    null_data_plot(df, df_name)

# Example usage:
# Assuming 'datasets' is your DataFrame and 'application_train' is one of its components
display_stats(datasets["bureau"], "bureau")
display_feature_info(datasets["bureau"], "bureau")
--------------------------------------------------------------------------------
                    bureau                    
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716428 entries, 0 to 1716427
Data columns (total 17 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   SK_ID_CURR              int64  
 1   SK_ID_BUREAU            int64  
 2   CREDIT_ACTIVE           object 
 3   CREDIT_CURRENCY         object 
 4   DAYS_CREDIT             int64  
 5   CREDIT_DAY_OVERDUE      int64  
 6   DAYS_CREDIT_ENDDATE     float64
 7   DAYS_ENDDATE_FACT       float64
 8   AMT_CREDIT_MAX_OVERDUE  float64
 9   CNT_CREDIT_PROLONG      int64  
 10  AMT_CREDIT_SUM          float64
 11  AMT_CREDIT_SUM_DEBT     float64
 12  AMT_CREDIT_SUM_LIMIT    float64
 13  AMT_CREDIT_SUM_OVERDUE  float64
 14  CREDIT_TYPE             object 
 15  DAYS_CREDIT_UPDATE      int64  
 16  AMT_ANNUITY             float64
dtypes: float64(8), int64(6), object(3)
memory usage: 222.6+ MB
None
---------------------------------------------------------------------------
Shape of the df bureau is (1716428, 17) 

---------------------------------------------------------------------------
Statistical summary of bureau is :
---------------------------------------------------------------------------
Description of the df bureau:

SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.00 307511.00 307511.00 3.075110e+05 307511.00 307499.00 307233.00 307511.00 307511.00 307511.00 307511.00 307511.00 104582.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307509.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 134133.00 306851.00 246546.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 159080.00 306490.00 306490.00 306490.00 306490.00 307510.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 265992.00 265992.00 265992.00 265992.00 265992.00 265992.00
mean 278180.52 0.08 0.42 1.687979e+05 599026.00 27108.57 538396.21 0.02 -16037.00 63815.05 -4986.12 -2994.20 12.06 1.0 0.82 0.2 1.00 0.28 0.06 2.15 2.05 2.03 12.06 0.02 0.05 0.04 0.08 0.23 0.18 0.50 0.51 0.51 0.12 0.09 0.98 0.75 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.11 0.09 0.98 0.76 0.04 0.07 0.15 0.22 0.23 0.06 0.11 0.11 0.01 0.03 0.12 0.09 0.98 0.76 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.10 1.42 0.14 1.41 0.10 -962.86 0.00 0.71 0.00 0.02 0.09 0.00 0.08 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.03 0.27 0.27 1.90
std 102790.18 0.27 0.72 2.371231e+05 402490.78 14493.74 369446.46 0.01 4363.99 141275.77 3522.89 1509.45 11.94 0.0 0.38 0.4 0.04 0.45 0.23 0.91 0.51 0.50 3.27 0.12 0.22 0.20 0.27 0.42 0.38 0.21 0.19 0.19 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.14 0.16 0.08 0.09 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.07 0.13 0.10 0.14 0.16 0.08 0.10 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.15 0.16 0.08 0.09 0.11 0.05 0.07 0.11 2.40 0.45 2.38 0.36 826.81 0.01 0.45 0.01 0.12 0.28 0.01 0.27 0.06 0.0 0.06 0.0 0.06 0.05 0.03 0.10 0.02 0.09 0.02 0.02 0.02 0.08 0.11 0.20 0.92 0.79 1.87
min 100002.00 0.00 0.00 2.565000e+04 45000.00 1615.50 40500.00 0.00 -25229.00 -17912.00 -24672.00 -7197.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -4292.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 189145.50 0.00 0.00 1.125000e+05 270000.00 16524.00 238500.00 0.01 -19682.00 -2760.00 -7479.50 -4299.00 5.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.39 0.37 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.05 0.04 0.98 0.70 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.04 0.00 0.00 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.04 0.00 0.00 0.00 0.00 -1570.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 278202.00 0.00 0.00 1.471500e+05 513531.00 24903.00 450000.00 0.02 -15750.00 -1213.00 -4504.00 -3254.00 9.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.57 0.54 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.08 0.07 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.07 0.00 0.00 0.00 0.00 -757.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
75% 367142.50 0.00 1.00 2.025000e+05 808650.00 34596.00 679500.00 0.03 -12413.00 -289.00 -2010.00 -1720.00 15.00 1.0 1.00 0.0 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.66 0.67 0.15 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.14 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.08 0.13 0.13 0.00 0.02 0.15 0.11 0.99 0.83 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.13 2.00 0.00 2.00 0.00 -274.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
max 456255.00 1.00 19.00 1.170000e+08 4050000.00 258025.50 4050000.00 0.07 -7489.00 365243.00 0.00 0.00 91.00 1.0 1.00 1.0 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.85 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 348.00 34.00 344.00 24.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0 1.00 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00
None
Description of the df continued for bureau:

---------------------------------------------------------------------------
Data type value counts: 
 float64    8
int64      6
object     3
Name: count, dtype: int64

Return the number of unique elements in the object. 

CREDIT_ACTIVE       4
CREDIT_CURRENCY     4
CREDIT_TYPE        15
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features  of bureau.
---------------------------------------------------------------------------

{'int64': Index(['SK_ID_CURR', 'SK_ID_BUREAU', 'DAYS_CREDIT', 'CREDIT_DAY_OVERDUE',
       'CNT_CREDIT_PROLONG', 'DAYS_CREDIT_UPDATE'],
      dtype='object')}
------------------------------
{'float64': Index(['DAYS_CREDIT_ENDDATE', 'DAYS_ENDDATE_FACT', 'AMT_CREDIT_MAX_OVERDUE',
       'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM_LIMIT',
       'AMT_CREDIT_SUM_OVERDUE', 'AMT_ANNUITY'],
      dtype='object')}
------------------------------
{'object': Index(['CREDIT_ACTIVE', 'CREDIT_CURRENCY', 'CREDIT_TYPE'], dtype='object')}
------------------------------

 

---------------------------------------------------------------------------
---------------------------------------------------------------------------

 The Missing Data: 

Percent Train Missing Count
AMT_ANNUITY 71.47 1226791
AMT_CREDIT_MAX_OVERDUE 65.51 1124488
DAYS_ENDDATE_FACT 36.92 633653
AMT_CREDIT_SUM_LIMIT 34.48 591780
AMT_CREDIT_SUM_DEBT 15.01 257669
DAYS_CREDIT_ENDDATE 6.15 105553
---------------------------------------------------------------------------

Statistical Summary of Bureau Dataset¶

In [22]:
display_stats(datasets['bureau'], 'bureau')
--------------------------------------------------------------------------------
                    bureau                    
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716428 entries, 0 to 1716427
Data columns (total 17 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   SK_ID_CURR              int64  
 1   SK_ID_BUREAU            int64  
 2   CREDIT_ACTIVE           object 
 3   CREDIT_CURRENCY         object 
 4   DAYS_CREDIT             int64  
 5   CREDIT_DAY_OVERDUE      int64  
 6   DAYS_CREDIT_ENDDATE     float64
 7   DAYS_ENDDATE_FACT       float64
 8   AMT_CREDIT_MAX_OVERDUE  float64
 9   CNT_CREDIT_PROLONG      int64  
 10  AMT_CREDIT_SUM          float64
 11  AMT_CREDIT_SUM_DEBT     float64
 12  AMT_CREDIT_SUM_LIMIT    float64
 13  AMT_CREDIT_SUM_OVERDUE  float64
 14  CREDIT_TYPE             object 
 15  DAYS_CREDIT_UPDATE      int64  
 16  AMT_ANNUITY             float64
dtypes: float64(8), int64(6), object(3)
memory usage: 222.6+ MB
None
---------------------------------------------------------------------------
Shape of the df bureau is (1716428, 17) 

---------------------------------------------------------------------------
Statistical summary of bureau is :
---------------------------------------------------------------------------
Description of the df bureau:

SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.00 307511.00 307511.00 3.075110e+05 307511.00 307499.00 307233.00 307511.00 307511.00 307511.00 307511.00 307511.00 104582.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307509.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 134133.00 306851.00 246546.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 159080.00 306490.00 306490.00 306490.00 306490.00 307510.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 265992.00 265992.00 265992.00 265992.00 265992.00 265992.00
mean 278180.52 0.08 0.42 1.687979e+05 599026.00 27108.57 538396.21 0.02 -16037.00 63815.05 -4986.12 -2994.20 12.06 1.0 0.82 0.2 1.00 0.28 0.06 2.15 2.05 2.03 12.06 0.02 0.05 0.04 0.08 0.23 0.18 0.50 0.51 0.51 0.12 0.09 0.98 0.75 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.11 0.09 0.98 0.76 0.04 0.07 0.15 0.22 0.23 0.06 0.11 0.11 0.01 0.03 0.12 0.09 0.98 0.76 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.10 1.42 0.14 1.41 0.10 -962.86 0.00 0.71 0.00 0.02 0.09 0.00 0.08 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.03 0.27 0.27 1.90
std 102790.18 0.27 0.72 2.371231e+05 402490.78 14493.74 369446.46 0.01 4363.99 141275.77 3522.89 1509.45 11.94 0.0 0.38 0.4 0.04 0.45 0.23 0.91 0.51 0.50 3.27 0.12 0.22 0.20 0.27 0.42 0.38 0.21 0.19 0.19 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.14 0.16 0.08 0.09 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.07 0.13 0.10 0.14 0.16 0.08 0.10 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.15 0.16 0.08 0.09 0.11 0.05 0.07 0.11 2.40 0.45 2.38 0.36 826.81 0.01 0.45 0.01 0.12 0.28 0.01 0.27 0.06 0.0 0.06 0.0 0.06 0.05 0.03 0.10 0.02 0.09 0.02 0.02 0.02 0.08 0.11 0.20 0.92 0.79 1.87
min 100002.00 0.00 0.00 2.565000e+04 45000.00 1615.50 40500.00 0.00 -25229.00 -17912.00 -24672.00 -7197.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -4292.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 189145.50 0.00 0.00 1.125000e+05 270000.00 16524.00 238500.00 0.01 -19682.00 -2760.00 -7479.50 -4299.00 5.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.39 0.37 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.05 0.04 0.98 0.70 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.04 0.00 0.00 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.04 0.00 0.00 0.00 0.00 -1570.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 278202.00 0.00 0.00 1.471500e+05 513531.00 24903.00 450000.00 0.02 -15750.00 -1213.00 -4504.00 -3254.00 9.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.57 0.54 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.08 0.07 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.07 0.00 0.00 0.00 0.00 -757.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
75% 367142.50 0.00 1.00 2.025000e+05 808650.00 34596.00 679500.00 0.03 -12413.00 -289.00 -2010.00 -1720.00 15.00 1.0 1.00 0.0 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.66 0.67 0.15 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.14 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.08 0.13 0.13 0.00 0.02 0.15 0.11 0.99 0.83 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.13 2.00 0.00 2.00 0.00 -274.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
max 456255.00 1.00 19.00 1.170000e+08 4050000.00 258025.50 4050000.00 0.07 -7489.00 365243.00 0.00 0.00 91.00 1.0 1.00 1.0 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.85 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 348.00 34.00 344.00 24.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0 1.00 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00
None

Columns in bureau Datset¶

In [23]:
datasets['bureau'].columns
Out[23]:
Index(['SK_ID_CURR', 'SK_ID_BUREAU', 'CREDIT_ACTIVE', 'CREDIT_CURRENCY',
       'DAYS_CREDIT', 'CREDIT_DAY_OVERDUE', 'DAYS_CREDIT_ENDDATE',
       'DAYS_ENDDATE_FACT', 'AMT_CREDIT_MAX_OVERDUE', 'CNT_CREDIT_PROLONG',
       'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM_LIMIT',
       'AMT_CREDIT_SUM_OVERDUE', 'CREDIT_TYPE', 'DAYS_CREDIT_UPDATE',
       'AMT_ANNUITY'],
      dtype='object')
In [24]:
datasets['bureau'].describe()
Out[24]:
SK_ID_CURR SK_ID_BUREAU DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE DAYS_CREDIT_UPDATE AMT_ANNUITY
count 1.716428e+06 1.716428e+06 1.716428e+06 1.716428e+06 1.610875e+06 1.082775e+06 5.919400e+05 1.716428e+06 1.716415e+06 1.458759e+06 1.124648e+06 1.716428e+06 1.716428e+06 4.896370e+05
mean 2.782149e+05 5.924434e+06 -1.142108e+03 8.181666e-01 5.105174e+02 -1.017437e+03 3.825418e+03 6.410406e-03 3.549946e+05 1.370851e+05 6.229515e+03 3.791276e+01 -5.937483e+02 1.571276e+04
std 1.029386e+05 5.322657e+05 7.951649e+02 3.654443e+01 4.994220e+03 7.140106e+02 2.060316e+05 9.622391e-02 1.149811e+06 6.774011e+05 4.503203e+04 5.937650e+03 7.207473e+02 3.258269e+05
min 1.000010e+05 5.000000e+06 -2.922000e+03 0.000000e+00 -4.206000e+04 -4.202300e+04 0.000000e+00 0.000000e+00 0.000000e+00 -4.705600e+06 -5.864061e+05 0.000000e+00 -4.194700e+04 0.000000e+00
25% 1.888668e+05 5.463954e+06 -1.666000e+03 0.000000e+00 -1.138000e+03 -1.489000e+03 0.000000e+00 0.000000e+00 5.130000e+04 0.000000e+00 0.000000e+00 0.000000e+00 -9.080000e+02 0.000000e+00
50% 2.780550e+05 5.926304e+06 -9.870000e+02 0.000000e+00 -3.300000e+02 -8.970000e+02 0.000000e+00 0.000000e+00 1.255185e+05 0.000000e+00 0.000000e+00 0.000000e+00 -3.950000e+02 0.000000e+00
75% 3.674260e+05 6.385681e+06 -4.740000e+02 0.000000e+00 4.740000e+02 -4.250000e+02 0.000000e+00 0.000000e+00 3.150000e+05 4.015350e+04 0.000000e+00 0.000000e+00 -3.300000e+01 1.350000e+04
max 4.562550e+05 6.843457e+06 0.000000e+00 2.792000e+03 3.119900e+04 0.000000e+00 1.159872e+08 9.000000e+00 5.850000e+08 1.701000e+08 4.705600e+06 3.756681e+06 3.720000e+02 1.184534e+08
In [25]:
datasets['bureau'].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716428 entries, 0 to 1716427
Data columns (total 17 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   SK_ID_CURR              int64  
 1   SK_ID_BUREAU            int64  
 2   CREDIT_ACTIVE           object 
 3   CREDIT_CURRENCY         object 
 4   DAYS_CREDIT             int64  
 5   CREDIT_DAY_OVERDUE      int64  
 6   DAYS_CREDIT_ENDDATE     float64
 7   DAYS_ENDDATE_FACT       float64
 8   AMT_CREDIT_MAX_OVERDUE  float64
 9   CNT_CREDIT_PROLONG      int64  
 10  AMT_CREDIT_SUM          float64
 11  AMT_CREDIT_SUM_DEBT     float64
 12  AMT_CREDIT_SUM_LIMIT    float64
 13  AMT_CREDIT_SUM_OVERDUE  float64
 14  CREDIT_TYPE             object 
 15  DAYS_CREDIT_UPDATE      int64  
 16  AMT_ANNUITY             float64
dtypes: float64(8), int64(6), object(3)
memory usage: 222.6+ MB

Important Categorical Features¶

In [26]:
bureau = datasets['bureau']
In [27]:
#Important Categorical Features
sns.set(style="darkgrid") 
fig, ax = plt.subplots(figsize=(10, 8)) 

sns.histplot(data=bureau, x="CREDIT_ACTIVE", ax=ax, color='green') 
add_data_labels(ax) 

plt.show()

important numerical Columns in Bureau¶

In [28]:
numerical_columns = bureau.select_dtypes(include=['float64', 'int64']).columns 
numerical_data = bureau[numerical_columns] 

numerical_data.hist(bins=50, figsize=(20,15))
plt.show()

Corelaton plot for Bureau Dataset¶

In [29]:
# Correlation Plot
correlation_data = bureau[['DAYS_CREDIT','CREDIT_DAY_OVERDUE','DAYS_CREDIT_ENDDATE', 'DAYS_ENDDATE_FACT', 'AMT_CREDIT_MAX_OVERDUE', 'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM_LIMIT', 'AMT_CREDIT_SUM_OVERDUE']] 
correlation_matrix = correlation_data.corr() 

plt.figure(figsize=(10, 8)) 
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=1) 
plt.title('Correlation Plot for Bureau')
plt.show()

EDA for Bureau-Balance Dataset¶

Summary of the Bureau-Balance Dataset and Missing Data¶

In [31]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML

# Assuming datasets is a DataFrame containing your data

def stats_summary1(df, df_name):
    print(datasets[df_name].info(verbose=True))
    print("-----" * 15)
    print(f"Shape of the df {df_name} is {df.shape} \n")
    print("-----" * 15)
    print(f"Statistical summary of {df_name} is :")
    print("-----" * 15)
    print(f"Description of the df {df_name}:\n")
    print(display(HTML(np.round(datasets['application_train'].describe(), 2).to_html())))

def stats_summary2(df, df_name):
    print(f"Description of the df continued for {df_name}:\n")
    print("-----" * 15)
    print("Data type value counts: \n", df.dtypes.value_counts())
    print("\nReturn the number of unique elements in the object. \n")
    print(df.select_dtypes('object').apply(pd.Series.nunique, axis=0))

# List the categorical and Numerical features of a DF
def feature_datatypes_groups(df, df_name):
    df_dtypes = df.columns.to_series().groupby(df.dtypes).groups
    print("-----" * 15)
    print(f"Categorical and Numerical(int + float) features  of {df_name}.")
    print("-----" * 15)
    print()
    for k, v in df_dtypes.items():
        print({k.name: v})
        print("---" * 10)
    print("\n \n")

# Null data list and plot.
def null_data_plot(df, df_name):
    percent = (df.isnull().sum() / df.isnull().count() * 100).sort_values(ascending=False).round(2)
    sum_missing = df.isna().sum().sort_values(ascending=False)
    missing_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
    missing_data = missing_data[missing_data['Percent'] > 0]
    print("-----" * 15)
    print("-----" * 15)
    print('\n The Missing Data: \n')
    if len(missing_data) == 0:
        print("No missing Data")
    else:
        display(HTML(missing_data.to_html()))  # display all the rows
        print("-----" * 15)

        if len(df.columns) > 35:
            f, ax = plt.subplots(figsize=(8, 15))
        else:
            f, ax = plt.subplots()

        plt.title(f'Percent missing data for {df_name}.', fontsize=10)
        fig = sns.barplot(x="Percent", y=missing_data.index, data=missing_data, alpha=0.8)
        plt.xlabel('Percent of missing values', fontsize=10)
        plt.ylabel('Features', fontsize=10)
        return missing_data

# Full consolidation of all the stats function.
def display_stats(df, df_name):
    print("--" * 40)
    print(" " * 20 + '\033[1m' + df_name + '\033[0m' + " " * 20)
    print("--" * 40)
    stats_summary1(df, df_name)

def display_feature_info(df, df_name):
    stats_summary2(df, df_name)
    feature_datatypes_groups(df, df_name)
    null_data_plot(df, df_name)

# Example usage:
# Assuming 'datasets' is your DataFrame and 'application_train' is one of its components
display_stats(datasets["bureau_balance"], "bureau_balance")
display_feature_info(datasets["bureau_balance"], "bureau_balance")
--------------------------------------------------------------------------------
                    bureau_balance                    
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27299925 entries, 0 to 27299924
Data columns (total 3 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   SK_ID_BUREAU    int64 
 1   MONTHS_BALANCE  int64 
 2   STATUS          object
dtypes: int64(2), object(1)
memory usage: 624.8+ MB
None
---------------------------------------------------------------------------
Shape of the df bureau_balance is (27299925, 3) 

---------------------------------------------------------------------------
Statistical summary of bureau_balance is :
---------------------------------------------------------------------------
Description of the df bureau_balance:

SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.00 307511.00 307511.00 3.075110e+05 307511.00 307499.00 307233.00 307511.00 307511.00 307511.00 307511.00 307511.00 104582.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307509.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 134133.00 306851.00 246546.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 159080.00 306490.00 306490.00 306490.00 306490.00 307510.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 265992.00 265992.00 265992.00 265992.00 265992.00 265992.00
mean 278180.52 0.08 0.42 1.687979e+05 599026.00 27108.57 538396.21 0.02 -16037.00 63815.05 -4986.12 -2994.20 12.06 1.0 0.82 0.2 1.00 0.28 0.06 2.15 2.05 2.03 12.06 0.02 0.05 0.04 0.08 0.23 0.18 0.50 0.51 0.51 0.12 0.09 0.98 0.75 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.11 0.09 0.98 0.76 0.04 0.07 0.15 0.22 0.23 0.06 0.11 0.11 0.01 0.03 0.12 0.09 0.98 0.76 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.10 1.42 0.14 1.41 0.10 -962.86 0.00 0.71 0.00 0.02 0.09 0.00 0.08 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.03 0.27 0.27 1.90
std 102790.18 0.27 0.72 2.371231e+05 402490.78 14493.74 369446.46 0.01 4363.99 141275.77 3522.89 1509.45 11.94 0.0 0.38 0.4 0.04 0.45 0.23 0.91 0.51 0.50 3.27 0.12 0.22 0.20 0.27 0.42 0.38 0.21 0.19 0.19 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.14 0.16 0.08 0.09 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.07 0.13 0.10 0.14 0.16 0.08 0.10 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.15 0.16 0.08 0.09 0.11 0.05 0.07 0.11 2.40 0.45 2.38 0.36 826.81 0.01 0.45 0.01 0.12 0.28 0.01 0.27 0.06 0.0 0.06 0.0 0.06 0.05 0.03 0.10 0.02 0.09 0.02 0.02 0.02 0.08 0.11 0.20 0.92 0.79 1.87
min 100002.00 0.00 0.00 2.565000e+04 45000.00 1615.50 40500.00 0.00 -25229.00 -17912.00 -24672.00 -7197.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -4292.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 189145.50 0.00 0.00 1.125000e+05 270000.00 16524.00 238500.00 0.01 -19682.00 -2760.00 -7479.50 -4299.00 5.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.39 0.37 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.05 0.04 0.98 0.70 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.04 0.00 0.00 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.04 0.00 0.00 0.00 0.00 -1570.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 278202.00 0.00 0.00 1.471500e+05 513531.00 24903.00 450000.00 0.02 -15750.00 -1213.00 -4504.00 -3254.00 9.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.57 0.54 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.08 0.07 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.07 0.00 0.00 0.00 0.00 -757.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
75% 367142.50 0.00 1.00 2.025000e+05 808650.00 34596.00 679500.00 0.03 -12413.00 -289.00 -2010.00 -1720.00 15.00 1.0 1.00 0.0 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.66 0.67 0.15 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.14 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.08 0.13 0.13 0.00 0.02 0.15 0.11 0.99 0.83 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.13 2.00 0.00 2.00 0.00 -274.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
max 456255.00 1.00 19.00 1.170000e+08 4050000.00 258025.50 4050000.00 0.07 -7489.00 365243.00 0.00 0.00 91.00 1.0 1.00 1.0 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.85 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 348.00 34.00 344.00 24.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0 1.00 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00
None
Description of the df continued for bureau_balance:

---------------------------------------------------------------------------
Data type value counts: 
 int64     2
object    1
Name: count, dtype: int64

Return the number of unique elements in the object. 

STATUS    8
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features  of bureau_balance.
---------------------------------------------------------------------------

{'int64': Index(['SK_ID_BUREAU', 'MONTHS_BALANCE'], dtype='object')}
------------------------------
{'object': Index(['STATUS'], dtype='object')}
------------------------------

 

---------------------------------------------------------------------------
---------------------------------------------------------------------------

 The Missing Data: 

No missing Data

Finding 4¶

Bureau balance & bureau have no missing data. These datasets can provide accurate aggreagte features.

In [8]:
bureauBal = datasets["bureau_balance"]
In [8]:
bureauBal.head(5)
Out[8]:
SK_ID_BUREAU MONTHS_BALANCE STATUS
0 5715448 0 C
1 5715448 -1 C
2 5715448 -2 C
3 5715448 -3 C
4 5715448 -4 C

Columns in bureauBal¶

In [9]:
bureauBal.columns
Out[9]:
Index(['SK_ID_BUREAU', 'MONTHS_BALANCE', 'STATUS'], dtype='object')
In [10]:
bureauBal.describe()
Out[10]:
SK_ID_BUREAU MONTHS_BALANCE
count 2.729992e+07 2.729992e+07
mean 6.036297e+06 -3.074169e+01
std 4.923489e+05 2.386451e+01
min 5.001709e+06 -9.600000e+01
25% 5.730933e+06 -4.600000e+01
50% 6.070821e+06 -2.500000e+01
75% 6.431951e+06 -1.100000e+01
max 6.842888e+06 0.000000e+00
In [12]:
bureauBal.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27299925 entries, 0 to 27299924
Data columns (total 3 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   SK_ID_BUREAU    int64 
 1   MONTHS_BALANCE  int64 
 2   STATUS          object
dtypes: int64(2), object(1)
memory usage: 624.8+ MB

important Categorical Features in Bureau-balance¶

In [13]:
#Important Categorical Feature
plt.figure(figsize=(10,5))
sns.set_theme()
sns.countplot(x = 'STATUS',data = bureauBal)
plt.xlabel("Status",fontweight='bold',size=13)
plt.ylabel("Count",fontweight='bold',size=13)
plt.show()

Box plot to visualize the relationship between STATUS and Months Balance¶

In [9]:
# Box plot to visualize the relationship between STATUS and Months Balance
plt.figure(figsize=(10, 6))
sns.boxplot(x='STATUS', y='MONTHS_BALANCE', data=bureauBal, order=['0', 'C', '1', '2', '3', '4', '5'])
plt.title('Box Plot of Months Balance by STATUS')
plt.xlabel('STATUS')
plt.ylabel('Months Balance')
plt.show()

EDA on Credit Balance Dataset¶

Summary of the Credit Balance Dataset and Missing Data¶

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML

# Assuming datasets is a DataFrame containing your data

def stats_summary1(df, df_name):
    print(datasets[df_name].info(verbose=True))
    print("-----" * 15)
    print(f"Shape of the df {df_name} is {df.shape} \n")
    print("-----" * 15)
    print(f"Statistical summary of {df_name} is :")
    print("-----" * 15)
    print(f"Description of the df {df_name}:\n")
    print(display(HTML(np.round(datasets['application_train'].describe(), 2).to_html())))

def stats_summary2(df, df_name):
    print(f"Description of the df continued for {df_name}:\n")
    print("-----" * 15)
    print("Data type value counts: \n", df.dtypes.value_counts())
    print("\nReturn the number of unique elements in the object. \n")
    print(df.select_dtypes('object').apply(pd.Series.nunique, axis=0))

# List the categorical and Numerical features of a DF
def feature_datatypes_groups(df, df_name):
    df_dtypes = df.columns.to_series().groupby(df.dtypes).groups
    print("-----" * 15)
    print(f"Categorical and Numerical(int + float) features  of {df_name}.")
    print("-----" * 15)
    print()
    for k, v in df_dtypes.items():
        print({k.name: v})
        print("---" * 10)
    print("\n \n")

# Null data list and plot.
def null_data_plot(df, df_name):
    percent = (df.isnull().sum() / df.isnull().count() * 100).sort_values(ascending=False).round(2)
    sum_missing = df.isna().sum().sort_values(ascending=False)
    missing_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
    missing_data = missing_data[missing_data['Percent'] > 0]
    print("-----" * 15)
    print("-----" * 15)
    print('\n The Missing Data: \n')
    if len(missing_data) == 0:
        print("No missing Data")
    else:
        display(HTML(missing_data.to_html()))  # display all the rows
        print("-----" * 15)

        if len(df.columns) > 35:
            f, ax = plt.subplots(figsize=(8, 15))
        else:
            f, ax = plt.subplots()

        plt.title(f'Percent missing data for {df_name}.', fontsize=10)
        fig = sns.barplot(x="Percent", y=missing_data.index, data=missing_data, alpha=0.8)
        plt.xlabel('Percent of missing values', fontsize=10)
        plt.ylabel('Features', fontsize=10)
        return missing_data

# Full consolidation of all the stats function.
def display_stats(df, df_name):
    print("--" * 40)
    print(" " * 20 + '\033[1m' + df_name + '\033[0m' + " " * 20)
    print("--" * 40)
    stats_summary1(df, df_name)

def display_feature_info(df, df_name):
    stats_summary2(df, df_name)
    feature_datatypes_groups(df, df_name)
    null_data_plot(df, df_name)

# Example usage:
# Assuming 'datasets' is your DataFrame and 'application_train' is one of its components
display_stats(datasets["credit_card_balance"], "credit_card_balance")
display_feature_info(datasets["credit_card_balance"], "credit_card_balance")
--------------------------------------------------------------------------------
                    credit_card_balance                    
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3840312 entries, 0 to 3840311
Data columns (total 23 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   SK_ID_PREV                  int64  
 1   SK_ID_CURR                  int64  
 2   MONTHS_BALANCE              int64  
 3   AMT_BALANCE                 float64
 4   AMT_CREDIT_LIMIT_ACTUAL     int64  
 5   AMT_DRAWINGS_ATM_CURRENT    float64
 6   AMT_DRAWINGS_CURRENT        float64
 7   AMT_DRAWINGS_OTHER_CURRENT  float64
 8   AMT_DRAWINGS_POS_CURRENT    float64
 9   AMT_INST_MIN_REGULARITY     float64
 10  AMT_PAYMENT_CURRENT         float64
 11  AMT_PAYMENT_TOTAL_CURRENT   float64
 12  AMT_RECEIVABLE_PRINCIPAL    float64
 13  AMT_RECIVABLE               float64
 14  AMT_TOTAL_RECEIVABLE        float64
 15  CNT_DRAWINGS_ATM_CURRENT    float64
 16  CNT_DRAWINGS_CURRENT        int64  
 17  CNT_DRAWINGS_OTHER_CURRENT  float64
 18  CNT_DRAWINGS_POS_CURRENT    float64
 19  CNT_INSTALMENT_MATURE_CUM   float64
 20  NAME_CONTRACT_STATUS        object 
 21  SK_DPD                      int64  
 22  SK_DPD_DEF                  int64  
dtypes: float64(15), int64(7), object(1)
memory usage: 673.9+ MB
None
---------------------------------------------------------------------------
Shape of the df credit_card_balance is (3840312, 23) 

---------------------------------------------------------------------------
Statistical summary of credit_card_balance is :
---------------------------------------------------------------------------
Description of the df credit_card_balance:

SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.00 307511.00 307511.00 3.075110e+05 307511.00 307499.00 307233.00 307511.00 307511.00 307511.00 307511.00 307511.00 104582.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307509.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 134133.00 306851.00 246546.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 159080.00 306490.00 306490.00 306490.00 306490.00 307510.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 265992.00 265992.00 265992.00 265992.00 265992.00 265992.00
mean 278180.52 0.08 0.42 1.687979e+05 599026.00 27108.57 538396.21 0.02 -16037.00 63815.05 -4986.12 -2994.20 12.06 1.0 0.82 0.2 1.00 0.28 0.06 2.15 2.05 2.03 12.06 0.02 0.05 0.04 0.08 0.23 0.18 0.50 0.51 0.51 0.12 0.09 0.98 0.75 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.11 0.09 0.98 0.76 0.04 0.07 0.15 0.22 0.23 0.06 0.11 0.11 0.01 0.03 0.12 0.09 0.98 0.76 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.10 1.42 0.14 1.41 0.10 -962.86 0.00 0.71 0.00 0.02 0.09 0.00 0.08 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.03 0.27 0.27 1.90
std 102790.18 0.27 0.72 2.371231e+05 402490.78 14493.74 369446.46 0.01 4363.99 141275.77 3522.89 1509.45 11.94 0.0 0.38 0.4 0.04 0.45 0.23 0.91 0.51 0.50 3.27 0.12 0.22 0.20 0.27 0.42 0.38 0.21 0.19 0.19 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.14 0.16 0.08 0.09 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.07 0.13 0.10 0.14 0.16 0.08 0.10 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.15 0.16 0.08 0.09 0.11 0.05 0.07 0.11 2.40 0.45 2.38 0.36 826.81 0.01 0.45 0.01 0.12 0.28 0.01 0.27 0.06 0.0 0.06 0.0 0.06 0.05 0.03 0.10 0.02 0.09 0.02 0.02 0.02 0.08 0.11 0.20 0.92 0.79 1.87
min 100002.00 0.00 0.00 2.565000e+04 45000.00 1615.50 40500.00 0.00 -25229.00 -17912.00 -24672.00 -7197.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -4292.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 189145.50 0.00 0.00 1.125000e+05 270000.00 16524.00 238500.00 0.01 -19682.00 -2760.00 -7479.50 -4299.00 5.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.39 0.37 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.05 0.04 0.98 0.70 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.04 0.00 0.00 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.04 0.00 0.00 0.00 0.00 -1570.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 278202.00 0.00 0.00 1.471500e+05 513531.00 24903.00 450000.00 0.02 -15750.00 -1213.00 -4504.00 -3254.00 9.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.57 0.54 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.08 0.07 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.07 0.00 0.00 0.00 0.00 -757.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
75% 367142.50 0.00 1.00 2.025000e+05 808650.00 34596.00 679500.00 0.03 -12413.00 -289.00 -2010.00 -1720.00 15.00 1.0 1.00 0.0 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.66 0.67 0.15 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.14 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.08 0.13 0.13 0.00 0.02 0.15 0.11 0.99 0.83 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.13 2.00 0.00 2.00 0.00 -274.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
max 456255.00 1.00 19.00 1.170000e+08 4050000.00 258025.50 4050000.00 0.07 -7489.00 365243.00 0.00 0.00 91.00 1.0 1.00 1.0 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.85 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 348.00 34.00 344.00 24.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0 1.00 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00
None
Description of the df continued for credit_card_balance:

---------------------------------------------------------------------------
Data type value counts: 
 float64    15
int64       7
object      1
Name: count, dtype: int64

Return the number of unique elements in the object. 

NAME_CONTRACT_STATUS    7
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features  of credit_card_balance.
---------------------------------------------------------------------------

{'int64': Index(['SK_ID_PREV', 'SK_ID_CURR', 'MONTHS_BALANCE', 'AMT_CREDIT_LIMIT_ACTUAL',
       'CNT_DRAWINGS_CURRENT', 'SK_DPD', 'SK_DPD_DEF'],
      dtype='object')}
------------------------------
{'float64': Index(['AMT_BALANCE', 'AMT_DRAWINGS_ATM_CURRENT', 'AMT_DRAWINGS_CURRENT',
       'AMT_DRAWINGS_OTHER_CURRENT', 'AMT_DRAWINGS_POS_CURRENT',
       'AMT_INST_MIN_REGULARITY', 'AMT_PAYMENT_CURRENT',
       'AMT_PAYMENT_TOTAL_CURRENT', 'AMT_RECEIVABLE_PRINCIPAL',
       'AMT_RECIVABLE', 'AMT_TOTAL_RECEIVABLE', 'CNT_DRAWINGS_ATM_CURRENT',
       'CNT_DRAWINGS_OTHER_CURRENT', 'CNT_DRAWINGS_POS_CURRENT',
       'CNT_INSTALMENT_MATURE_CUM'],
      dtype='object')}
------------------------------
{'object': Index(['NAME_CONTRACT_STATUS'], dtype='object')}
------------------------------

 

---------------------------------------------------------------------------
---------------------------------------------------------------------------

 The Missing Data: 

Percent Train Missing Count
AMT_PAYMENT_CURRENT 20.00 767988
AMT_DRAWINGS_ATM_CURRENT 19.52 749816
CNT_DRAWINGS_POS_CURRENT 19.52 749816
AMT_DRAWINGS_OTHER_CURRENT 19.52 749816
AMT_DRAWINGS_POS_CURRENT 19.52 749816
CNT_DRAWINGS_OTHER_CURRENT 19.52 749816
CNT_DRAWINGS_ATM_CURRENT 19.52 749816
CNT_INSTALMENT_MATURE_CUM 7.95 305236
AMT_INST_MIN_REGULARITY 7.95 305236
---------------------------------------------------------------------------
In [10]:
CCBalance = datasets["credit_card_balance"]

Columns in credit_card_balance¶

In [13]:
CCBalance.columns
Out[13]:
Index(['SK_ID_PREV', 'SK_ID_CURR', 'MONTHS_BALANCE', 'AMT_BALANCE',
       'AMT_CREDIT_LIMIT_ACTUAL', 'AMT_DRAWINGS_ATM_CURRENT',
       'AMT_DRAWINGS_CURRENT', 'AMT_DRAWINGS_OTHER_CURRENT',
       'AMT_DRAWINGS_POS_CURRENT', 'AMT_INST_MIN_REGULARITY',
       'AMT_PAYMENT_CURRENT', 'AMT_PAYMENT_TOTAL_CURRENT',
       'AMT_RECEIVABLE_PRINCIPAL', 'AMT_RECIVABLE', 'AMT_TOTAL_RECEIVABLE',
       'CNT_DRAWINGS_ATM_CURRENT', 'CNT_DRAWINGS_CURRENT',
       'CNT_DRAWINGS_OTHER_CURRENT', 'CNT_DRAWINGS_POS_CURRENT',
       'CNT_INSTALMENT_MATURE_CUM', 'NAME_CONTRACT_STATUS', 'SK_DPD',
       'SK_DPD_DEF'],
      dtype='object')
In [14]:
CCBalance.describe()
Out[14]:
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE AMT_BALANCE AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT AMT_DRAWINGS_OTHER_CURRENT AMT_DRAWINGS_POS_CURRENT AMT_INST_MIN_REGULARITY ... AMT_RECEIVABLE_PRINCIPAL AMT_RECIVABLE AMT_TOTAL_RECEIVABLE CNT_DRAWINGS_ATM_CURRENT CNT_DRAWINGS_CURRENT CNT_DRAWINGS_OTHER_CURRENT CNT_DRAWINGS_POS_CURRENT CNT_INSTALMENT_MATURE_CUM SK_DPD SK_DPD_DEF
count 3.840312e+06 3.840312e+06 3.840312e+06 3.840312e+06 3.840312e+06 3.090496e+06 3.840312e+06 3.090496e+06 3.090496e+06 3.535076e+06 ... 3.840312e+06 3.840312e+06 3.840312e+06 3.090496e+06 3.840312e+06 3.090496e+06 3.090496e+06 3.535076e+06 3.840312e+06 3.840312e+06
mean 1.904504e+06 2.783242e+05 -3.452192e+01 5.830016e+04 1.538080e+05 5.961325e+03 7.433388e+03 2.881696e+02 2.968805e+03 3.540204e+03 ... 5.596588e+04 5.808881e+04 5.809829e+04 3.094490e-01 7.031439e-01 4.812496e-03 5.594791e-01 2.082508e+01 9.283667e+00 3.316220e-01
std 5.364695e+05 1.027045e+05 2.666775e+01 1.063070e+05 1.651457e+05 2.822569e+04 3.384608e+04 8.201989e+03 2.079689e+04 5.600154e+03 ... 1.025336e+05 1.059654e+05 1.059718e+05 1.100401e+00 3.190347e+00 8.263861e-02 3.240649e+00 2.005149e+01 9.751570e+01 2.147923e+01
min 1.000018e+06 1.000060e+05 -9.600000e+01 -4.202502e+05 0.000000e+00 -6.827310e+03 -6.211620e+03 0.000000e+00 0.000000e+00 0.000000e+00 ... -4.233058e+05 -4.202502e+05 -4.202502e+05 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 1.434385e+06 1.895170e+05 -5.500000e+01 0.000000e+00 4.500000e+04 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 4.000000e+00 0.000000e+00 0.000000e+00
50% 1.897122e+06 2.783960e+05 -2.800000e+01 0.000000e+00 1.125000e+05 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.500000e+01 0.000000e+00 0.000000e+00
75% 2.369328e+06 3.675800e+05 -1.100000e+01 8.904669e+04 1.800000e+05 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 6.633911e+03 ... 8.535924e+04 8.889949e+04 8.891451e+04 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.200000e+01 0.000000e+00 0.000000e+00
max 2.843496e+06 4.562500e+05 -1.000000e+00 1.505902e+06 1.350000e+06 2.115000e+06 2.287098e+06 1.529847e+06 2.239274e+06 2.028820e+05 ... 1.472317e+06 1.493338e+06 1.493338e+06 5.100000e+01 1.650000e+02 1.200000e+01 1.650000e+02 1.200000e+02 3.260000e+03 3.260000e+03

8 rows × 22 columns

In [15]:
CCBalance.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3840312 entries, 0 to 3840311
Data columns (total 23 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   SK_ID_PREV                  int64  
 1   SK_ID_CURR                  int64  
 2   MONTHS_BALANCE              int64  
 3   AMT_BALANCE                 float64
 4   AMT_CREDIT_LIMIT_ACTUAL     int64  
 5   AMT_DRAWINGS_ATM_CURRENT    float64
 6   AMT_DRAWINGS_CURRENT        float64
 7   AMT_DRAWINGS_OTHER_CURRENT  float64
 8   AMT_DRAWINGS_POS_CURRENT    float64
 9   AMT_INST_MIN_REGULARITY     float64
 10  AMT_PAYMENT_CURRENT         float64
 11  AMT_PAYMENT_TOTAL_CURRENT   float64
 12  AMT_RECEIVABLE_PRINCIPAL    float64
 13  AMT_RECIVABLE               float64
 14  AMT_TOTAL_RECEIVABLE        float64
 15  CNT_DRAWINGS_ATM_CURRENT    float64
 16  CNT_DRAWINGS_CURRENT        int64  
 17  CNT_DRAWINGS_OTHER_CURRENT  float64
 18  CNT_DRAWINGS_POS_CURRENT    float64
 19  CNT_INSTALMENT_MATURE_CUM   float64
 20  NAME_CONTRACT_STATUS        object 
 21  SK_DPD                      int64  
 22  SK_DPD_DEF                  int64  
dtypes: float64(15), int64(7), object(1)
memory usage: 673.9+ MB

Testing Contract Status Variable¶

In [16]:
#Testing Contract Status Variable
plt.figure(figsize=(10,5))
sns.set_theme()
sns.countplot(x = 'NAME_CONTRACT_STATUS',data = CCBalance)
plt.xlabel("Contract Status",fontweight='bold',size=13)
plt.ylabel("Number",fontweight='bold',size=13)
plt.show()

Important Numerical Features¶

In [17]:
#Important Numerical Features
sns.set(style="darkgrid")
fig,axs=plt.subplots(2,2,figsize=(10,8))
sns.histplot(data=CCBalance,x="MONTHS_BALANCE",kde=True,ax=axs[0,0],color='green')
sns.histplot(data=CCBalance,x="AMT_BALANCE",kde=True,ax=axs[0,1],color='red')
sns.histplot(data=CCBalance,x="AMT_CREDIT_LIMIT_ACTUAL",kde=True,ax=axs[1,0],color='blue')
sns.histplot(data=CCBalance,x="AMT_DRAWINGS_CURRENT",kde=True,ax=axs[1,1],color='yellow')
Out[17]:
<Axes: xlabel='AMT_DRAWINGS_CURRENT', ylabel='Count'>
In [11]:
numerical_columns = CCBalance.select_dtypes(include=['float64', 'int64']).columns 
numerical_data = CCBalance[numerical_columns] 

numerical_data.hist(bins=50, figsize=(20,15))
plt.show()

Correlation between all vairables¶

In [12]:
#Correlation between all vairables
exclude_variables = ['SK_ID_PREV', 'SK_ID_CURR', 'NAME_CONTRACT_STATUS'] 

selected_variables = [col for col in CCBalance.columns if col not in exclude_variables] 
correlation_data = CCBalance[selected_variables] 
correlation_matrix = correlation_data.corr() 

plt.figure(figsize=(16, 14)) 
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=1) 
plt.title('Correlation Matrix for Cedit Card Balance') 
plt.show()

EDA on Installments_Payments Dataset¶

Summary of the Installments_Payments Dataset and Missing Data¶

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML

# Assuming datasets is a DataFrame containing your data

def stats_summary1(df, df_name):
    print(datasets[df_name].info(verbose=True))
    print("-----" * 15)
    print(f"Shape of the df {df_name} is {df.shape} \n")
    print("-----" * 15)
    print(f"Statistical summary of {df_name} is :")
    print("-----" * 15)
    print(f"Description of the df {df_name}:\n")
    print(display(HTML(np.round(datasets['application_train'].describe(), 2).to_html())))

def stats_summary2(df, df_name):
    print(f"Description of the df continued for {df_name}:\n")
    print("-----" * 15)
    print("Data type value counts: \n", df.dtypes.value_counts())
    print("\nReturn the number of unique elements in the object. \n")
    print(df.select_dtypes('object').apply(pd.Series.nunique, axis=0))

# List the categorical and Numerical features of a DF
def feature_datatypes_groups(df, df_name):
    df_dtypes = df.columns.to_series().groupby(df.dtypes).groups
    print("-----" * 15)
    print(f"Categorical and Numerical(int + float) features  of {df_name}.")
    print("-----" * 15)
    print()
    for k, v in df_dtypes.items():
        print({k.name: v})
        print("---" * 10)
    print("\n \n")

# Null data list and plot.
def null_data_plot(df, df_name):
    percent = (df.isnull().sum() / df.isnull().count() * 100).sort_values(ascending=False).round(2)
    sum_missing = df.isna().sum().sort_values(ascending=False)
    missing_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
    missing_data = missing_data[missing_data['Percent'] > 0]
    print("-----" * 15)
    print("-----" * 15)
    print('\n The Missing Data: \n')
    if len(missing_data) == 0:
        print("No missing Data")
    else:
        display(HTML(missing_data.to_html()))  # display all the rows
        print("-----" * 15)

        if len(df.columns) > 35:
            f, ax = plt.subplots(figsize=(8, 15))
        else:
            f, ax = plt.subplots()

        plt.title(f'Percent missing data for {df_name}.', fontsize=10)
        fig = sns.barplot(x="Percent", y=missing_data.index, data=missing_data, alpha=0.8)
        plt.xlabel('Percent of missing values', fontsize=10)
        plt.ylabel('Features', fontsize=10)
        return missing_data

# Full consolidation of all the stats function.
def display_stats(df, df_name):
    print("--" * 40)
    print(" " * 20 + '\033[1m' + df_name + '\033[0m' + " " * 20)
    print("--" * 40)
    stats_summary1(df, df_name)

def display_feature_info(df, df_name):
    stats_summary2(df, df_name)
    feature_datatypes_groups(df, df_name)
    null_data_plot(df, df_name)

# Example usage:
# Assuming 'datasets' is your DataFrame and 'application_train' is one of its components
display_stats(datasets["installments_payments"], "installments_payments")
display_feature_info(datasets["installments_payments"], "installments_payments")
--------------------------------------------------------------------------------
                    installments_payments                    
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13605401 entries, 0 to 13605400
Data columns (total 8 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   SK_ID_PREV              int64  
 1   SK_ID_CURR              int64  
 2   NUM_INSTALMENT_VERSION  float64
 3   NUM_INSTALMENT_NUMBER   int64  
 4   DAYS_INSTALMENT         float64
 5   DAYS_ENTRY_PAYMENT      float64
 6   AMT_INSTALMENT          float64
 7   AMT_PAYMENT             float64
dtypes: float64(5), int64(3)
memory usage: 830.4 MB
None
---------------------------------------------------------------------------
Shape of the df installments_payments is (13605401, 8) 

---------------------------------------------------------------------------
Statistical summary of installments_payments is :
---------------------------------------------------------------------------
Description of the df installments_payments:

SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.00 307511.00 307511.00 3.075110e+05 307511.00 307499.00 307233.00 307511.00 307511.00 307511.00 307511.00 307511.00 104582.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307509.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 134133.00 306851.00 246546.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 159080.00 306490.00 306490.00 306490.00 306490.00 307510.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 265992.00 265992.00 265992.00 265992.00 265992.00 265992.00
mean 278180.52 0.08 0.42 1.687979e+05 599026.00 27108.57 538396.21 0.02 -16037.00 63815.05 -4986.12 -2994.20 12.06 1.0 0.82 0.2 1.00 0.28 0.06 2.15 2.05 2.03 12.06 0.02 0.05 0.04 0.08 0.23 0.18 0.50 0.51 0.51 0.12 0.09 0.98 0.75 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.11 0.09 0.98 0.76 0.04 0.07 0.15 0.22 0.23 0.06 0.11 0.11 0.01 0.03 0.12 0.09 0.98 0.76 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.10 1.42 0.14 1.41 0.10 -962.86 0.00 0.71 0.00 0.02 0.09 0.00 0.08 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.03 0.27 0.27 1.90
std 102790.18 0.27 0.72 2.371231e+05 402490.78 14493.74 369446.46 0.01 4363.99 141275.77 3522.89 1509.45 11.94 0.0 0.38 0.4 0.04 0.45 0.23 0.91 0.51 0.50 3.27 0.12 0.22 0.20 0.27 0.42 0.38 0.21 0.19 0.19 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.14 0.16 0.08 0.09 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.07 0.13 0.10 0.14 0.16 0.08 0.10 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.15 0.16 0.08 0.09 0.11 0.05 0.07 0.11 2.40 0.45 2.38 0.36 826.81 0.01 0.45 0.01 0.12 0.28 0.01 0.27 0.06 0.0 0.06 0.0 0.06 0.05 0.03 0.10 0.02 0.09 0.02 0.02 0.02 0.08 0.11 0.20 0.92 0.79 1.87
min 100002.00 0.00 0.00 2.565000e+04 45000.00 1615.50 40500.00 0.00 -25229.00 -17912.00 -24672.00 -7197.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -4292.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 189145.50 0.00 0.00 1.125000e+05 270000.00 16524.00 238500.00 0.01 -19682.00 -2760.00 -7479.50 -4299.00 5.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.39 0.37 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.05 0.04 0.98 0.70 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.04 0.00 0.00 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.04 0.00 0.00 0.00 0.00 -1570.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 278202.00 0.00 0.00 1.471500e+05 513531.00 24903.00 450000.00 0.02 -15750.00 -1213.00 -4504.00 -3254.00 9.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.57 0.54 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.08 0.07 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.07 0.00 0.00 0.00 0.00 -757.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
75% 367142.50 0.00 1.00 2.025000e+05 808650.00 34596.00 679500.00 0.03 -12413.00 -289.00 -2010.00 -1720.00 15.00 1.0 1.00 0.0 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.66 0.67 0.15 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.14 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.08 0.13 0.13 0.00 0.02 0.15 0.11 0.99 0.83 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.13 2.00 0.00 2.00 0.00 -274.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
max 456255.00 1.00 19.00 1.170000e+08 4050000.00 258025.50 4050000.00 0.07 -7489.00 365243.00 0.00 0.00 91.00 1.0 1.00 1.0 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.85 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 348.00 34.00 344.00 24.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0 1.00 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00
None
Description of the df continued for installments_payments:

---------------------------------------------------------------------------
Data type value counts: 
 float64    5
int64      3
Name: count, dtype: int64

Return the number of unique elements in the object. 

Series([], dtype: float64)
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features  of installments_payments.
---------------------------------------------------------------------------

{'int64': Index(['SK_ID_PREV', 'SK_ID_CURR', 'NUM_INSTALMENT_NUMBER'], dtype='object')}
------------------------------
{'float64': Index(['NUM_INSTALMENT_VERSION', 'DAYS_INSTALMENT', 'DAYS_ENTRY_PAYMENT',
       'AMT_INSTALMENT', 'AMT_PAYMENT'],
      dtype='object')}
------------------------------

 

---------------------------------------------------------------------------
---------------------------------------------------------------------------

 The Missing Data: 

Percent Train Missing Count
DAYS_ENTRY_PAYMENT 0.02 2905
AMT_PAYMENT 0.02 2905
---------------------------------------------------------------------------
In [8]:
installPay = datasets["installments_payments"]

Columns in installments_payments Dataset¶

In [15]:
installPay.columns
Out[15]:
Index(['SK_ID_PREV', 'SK_ID_CURR', 'NUM_INSTALMENT_VERSION',
       'NUM_INSTALMENT_NUMBER', 'DAYS_INSTALMENT', 'DAYS_ENTRY_PAYMENT',
       'AMT_INSTALMENT', 'AMT_PAYMENT'],
      dtype='object')
In [16]:
installPay.describe()
Out[16]:
SK_ID_PREV SK_ID_CURR NUM_INSTALMENT_VERSION NUM_INSTALMENT_NUMBER DAYS_INSTALMENT DAYS_ENTRY_PAYMENT AMT_INSTALMENT AMT_PAYMENT
count 1.360540e+07 1.360540e+07 1.360540e+07 1.360540e+07 1.360540e+07 1.360250e+07 1.360540e+07 1.360250e+07
mean 1.903365e+06 2.784449e+05 8.566373e-01 1.887090e+01 -1.042270e+03 -1.051114e+03 1.705091e+04 1.723822e+04
std 5.362029e+05 1.027183e+05 1.035216e+00 2.666407e+01 8.009463e+02 8.005859e+02 5.057025e+04 5.473578e+04
min 1.000001e+06 1.000010e+05 0.000000e+00 1.000000e+00 -2.922000e+03 -4.921000e+03 0.000000e+00 0.000000e+00
25% 1.434191e+06 1.896390e+05 0.000000e+00 4.000000e+00 -1.654000e+03 -1.662000e+03 4.226085e+03 3.398265e+03
50% 1.896520e+06 2.786850e+05 1.000000e+00 8.000000e+00 -8.180000e+02 -8.270000e+02 8.884080e+03 8.125515e+03
75% 2.369094e+06 3.675300e+05 1.000000e+00 1.900000e+01 -3.610000e+02 -3.700000e+02 1.671021e+04 1.610842e+04
max 2.843499e+06 4.562550e+05 1.780000e+02 2.770000e+02 -1.000000e+00 -1.000000e+00 3.771488e+06 3.771488e+06
In [17]:
installPay.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13605401 entries, 0 to 13605400
Data columns (total 8 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   SK_ID_PREV              int64  
 1   SK_ID_CURR              int64  
 2   NUM_INSTALMENT_VERSION  float64
 3   NUM_INSTALMENT_NUMBER   int64  
 4   DAYS_INSTALMENT         float64
 5   DAYS_ENTRY_PAYMENT      float64
 6   AMT_INSTALMENT          float64
 7   AMT_PAYMENT             float64
dtypes: float64(5), int64(3)
memory usage: 830.4 MB

Correlation between all vairables¶

In [18]:
#Correlation between all vairables
exclude_variables = ['SK_ID_PREV', 'SK_ID_CURR'] 

selected_variables = [col for col in installPay.columns if col not in exclude_variables] 
correlation_data = installPay[selected_variables] 
correlation_matrix = correlation_data.corr() 

plt.figure(figsize=(16, 14)) 
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=1) 
plt.title('Correlation Matrix for Installment Payments') 
plt.show()

Important Features in Installments_Payments Dataset¶

In [9]:
sns.set(style="darkgrid")
fig,axs=plt.subplots(2,2,figsize=(10,8))
sns.histplot(data=installPay,x="NUM_INSTALMENT_NUMBER",kde=True,ax=axs[0,0],color='green')
sns.histplot(data=installPay,x="DAYS_INSTALMENT",kde=True,ax=axs[0,1],color='red')
sns.histplot(data=installPay,x="DAYS_ENTRY_PAYMENT",kde=True,ax=axs[1,0],color='blue')
Out[9]:
<Axes: xlabel='DAYS_ENTRY_PAYMENT', ylabel='Count'>

EDA on POS_CASH_balance¶

Summary of the POS_CASH_balance Dataset and Missing Data¶

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML

# Assuming datasets is a DataFrame containing your data

def stats_summary1(df, df_name):
    print(datasets[df_name].info(verbose=True))
    print("-----" * 15)
    print(f"Shape of the df {df_name} is {df.shape} \n")
    print("-----" * 15)
    print(f"Statistical summary of {df_name} is :")
    print("-----" * 15)
    print(f"Description of the df {df_name}:\n")
    print(display(HTML(np.round(datasets['application_train'].describe(), 2).to_html())))

def stats_summary2(df, df_name):
    print(f"Description of the df continued for {df_name}:\n")
    print("-----" * 15)
    print("Data type value counts: \n", df.dtypes.value_counts())
    print("\nReturn the number of unique elements in the object. \n")
    print(df.select_dtypes('object').apply(pd.Series.nunique, axis=0))

# List the categorical and Numerical features of a DF
def feature_datatypes_groups(df, df_name):
    df_dtypes = df.columns.to_series().groupby(df.dtypes).groups
    print("-----" * 15)
    print(f"Categorical and Numerical(int + float) features  of {df_name}.")
    print("-----" * 15)
    print()
    for k, v in df_dtypes.items():
        print({k.name: v})
        print("---" * 10)
    print("\n \n")

# Null data list and plot.
def null_data_plot(df, df_name):
    percent = (df.isnull().sum() / df.isnull().count() * 100).sort_values(ascending=False).round(2)
    sum_missing = df.isna().sum().sort_values(ascending=False)
    missing_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
    missing_data = missing_data[missing_data['Percent'] > 0]
    print("-----" * 15)
    print("-----" * 15)
    print('\n The Missing Data: \n')
    if len(missing_data) == 0:
        print("No missing Data")
    else:
        display(HTML(missing_data.to_html()))  # display all the rows
        print("-----" * 15)

        if len(df.columns) > 35:
            f, ax = plt.subplots(figsize=(8, 15))
        else:
            f, ax = plt.subplots()

        plt.title(f'Percent missing data for {df_name}.', fontsize=10)
        fig = sns.barplot(x="Percent", y=missing_data.index, data=missing_data, alpha=0.8)
        plt.xlabel('Percent of missing values', fontsize=10)
        plt.ylabel('Features', fontsize=10)
        return missing_data

# Full consolidation of all the stats function.
def display_stats(df, df_name):
    print("--" * 40)
    print(" " * 20 + '\033[1m' + df_name + '\033[0m' + " " * 20)
    print("--" * 40)
    stats_summary1(df, df_name)

def display_feature_info(df, df_name):
    stats_summary2(df, df_name)
    feature_datatypes_groups(df, df_name)
    null_data_plot(df, df_name)

# Example usage:
# Assuming 'datasets' is your DataFrame and 'application_train' is one of its components
display_stats(datasets["POS_CASH_balance"], "POS_CASH_balance")
display_feature_info(datasets["POS_CASH_balance"], "POS_CASH_balance")
--------------------------------------------------------------------------------
                    POS_CASH_balance                    
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001358 entries, 0 to 10001357
Data columns (total 8 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   SK_ID_PREV             int64  
 1   SK_ID_CURR             int64  
 2   MONTHS_BALANCE         int64  
 3   CNT_INSTALMENT         float64
 4   CNT_INSTALMENT_FUTURE  float64
 5   NAME_CONTRACT_STATUS   object 
 6   SK_DPD                 int64  
 7   SK_DPD_DEF             int64  
dtypes: float64(2), int64(5), object(1)
memory usage: 610.4+ MB
None
---------------------------------------------------------------------------
Shape of the df POS_CASH_balance is (10001358, 8) 

---------------------------------------------------------------------------
Statistical summary of POS_CASH_balance is :
---------------------------------------------------------------------------
Description of the df POS_CASH_balance:

SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.00 307511.00 307511.00 3.075110e+05 307511.00 307499.00 307233.00 307511.00 307511.00 307511.00 307511.00 307511.00 104582.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307509.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 134133.00 306851.00 246546.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 159080.00 306490.00 306490.00 306490.00 306490.00 307510.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 265992.00 265992.00 265992.00 265992.00 265992.00 265992.00
mean 278180.52 0.08 0.42 1.687979e+05 599026.00 27108.57 538396.21 0.02 -16037.00 63815.05 -4986.12 -2994.20 12.06 1.0 0.82 0.2 1.00 0.28 0.06 2.15 2.05 2.03 12.06 0.02 0.05 0.04 0.08 0.23 0.18 0.50 0.51 0.51 0.12 0.09 0.98 0.75 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.11 0.09 0.98 0.76 0.04 0.07 0.15 0.22 0.23 0.06 0.11 0.11 0.01 0.03 0.12 0.09 0.98 0.76 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.10 1.42 0.14 1.41 0.10 -962.86 0.00 0.71 0.00 0.02 0.09 0.00 0.08 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.03 0.27 0.27 1.90
std 102790.18 0.27 0.72 2.371231e+05 402490.78 14493.74 369446.46 0.01 4363.99 141275.77 3522.89 1509.45 11.94 0.0 0.38 0.4 0.04 0.45 0.23 0.91 0.51 0.50 3.27 0.12 0.22 0.20 0.27 0.42 0.38 0.21 0.19 0.19 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.14 0.16 0.08 0.09 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.07 0.13 0.10 0.14 0.16 0.08 0.10 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.15 0.16 0.08 0.09 0.11 0.05 0.07 0.11 2.40 0.45 2.38 0.36 826.81 0.01 0.45 0.01 0.12 0.28 0.01 0.27 0.06 0.0 0.06 0.0 0.06 0.05 0.03 0.10 0.02 0.09 0.02 0.02 0.02 0.08 0.11 0.20 0.92 0.79 1.87
min 100002.00 0.00 0.00 2.565000e+04 45000.00 1615.50 40500.00 0.00 -25229.00 -17912.00 -24672.00 -7197.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -4292.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 189145.50 0.00 0.00 1.125000e+05 270000.00 16524.00 238500.00 0.01 -19682.00 -2760.00 -7479.50 -4299.00 5.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.39 0.37 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.05 0.04 0.98 0.70 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.04 0.00 0.00 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.04 0.00 0.00 0.00 0.00 -1570.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 278202.00 0.00 0.00 1.471500e+05 513531.00 24903.00 450000.00 0.02 -15750.00 -1213.00 -4504.00 -3254.00 9.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.57 0.54 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.08 0.07 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.07 0.00 0.00 0.00 0.00 -757.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
75% 367142.50 0.00 1.00 2.025000e+05 808650.00 34596.00 679500.00 0.03 -12413.00 -289.00 -2010.00 -1720.00 15.00 1.0 1.00 0.0 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.66 0.67 0.15 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.14 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.08 0.13 0.13 0.00 0.02 0.15 0.11 0.99 0.83 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.13 2.00 0.00 2.00 0.00 -274.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
max 456255.00 1.00 19.00 1.170000e+08 4050000.00 258025.50 4050000.00 0.07 -7489.00 365243.00 0.00 0.00 91.00 1.0 1.00 1.0 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.85 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 348.00 34.00 344.00 24.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0 1.00 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00
None
Description of the df continued for POS_CASH_balance:

---------------------------------------------------------------------------
Data type value counts: 
 int64      5
float64    2
object     1
Name: count, dtype: int64

Return the number of unique elements in the object. 

NAME_CONTRACT_STATUS    9
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features  of POS_CASH_balance.
---------------------------------------------------------------------------

{'int64': Index(['SK_ID_PREV', 'SK_ID_CURR', 'MONTHS_BALANCE', 'SK_DPD', 'SK_DPD_DEF'], dtype='object')}
------------------------------
{'float64': Index(['CNT_INSTALMENT', 'CNT_INSTALMENT_FUTURE'], dtype='object')}
------------------------------
{'object': Index(['NAME_CONTRACT_STATUS'], dtype='object')}
------------------------------

 

---------------------------------------------------------------------------
---------------------------------------------------------------------------

 The Missing Data: 

Percent Train Missing Count
CNT_INSTALMENT_FUTURE 0.26 26087
CNT_INSTALMENT 0.26 26071
---------------------------------------------------------------------------
In [12]:
POS = datasets["POS_CASH_balance"]
In [13]:
POS.shape
Out[13]:
(10001358, 8)

Columns in POS_CASH_balance Dataset¶

In [14]:
POS.columns
Out[14]:
Index(['SK_ID_PREV', 'SK_ID_CURR', 'MONTHS_BALANCE', 'CNT_INSTALMENT',
       'CNT_INSTALMENT_FUTURE', 'NAME_CONTRACT_STATUS', 'SK_DPD',
       'SK_DPD_DEF'],
      dtype='object')
In [15]:
POS.describe()
Out[15]:
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT CNT_INSTALMENT_FUTURE SK_DPD SK_DPD_DEF
count 1.000136e+07 1.000136e+07 1.000136e+07 9.975287e+06 9.975271e+06 1.000136e+07 1.000136e+07
mean 1.903217e+06 2.784039e+05 -3.501259e+01 1.708965e+01 1.048384e+01 1.160693e+01 6.544684e-01
std 5.358465e+05 1.027637e+05 2.606657e+01 1.199506e+01 1.110906e+01 1.327140e+02 3.276249e+01
min 1.000001e+06 1.000010e+05 -9.600000e+01 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 1.434405e+06 1.895500e+05 -5.400000e+01 1.000000e+01 3.000000e+00 0.000000e+00 0.000000e+00
50% 1.896565e+06 2.786540e+05 -2.800000e+01 1.200000e+01 7.000000e+00 0.000000e+00 0.000000e+00
75% 2.368963e+06 3.674290e+05 -1.300000e+01 2.400000e+01 1.400000e+01 0.000000e+00 0.000000e+00
max 2.843499e+06 4.562550e+05 -1.000000e+00 9.200000e+01 8.500000e+01 4.231000e+03 3.595000e+03
In [16]:
POS.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001358 entries, 0 to 10001357
Data columns (total 8 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   SK_ID_PREV             int64  
 1   SK_ID_CURR             int64  
 2   MONTHS_BALANCE         int64  
 3   CNT_INSTALMENT         float64
 4   CNT_INSTALMENT_FUTURE  float64
 5   NAME_CONTRACT_STATUS   object 
 6   SK_DPD                 int64  
 7   SK_DPD_DEF             int64  
dtypes: float64(2), int64(5), object(1)
memory usage: 610.4+ MB

Histogram of Months Balance¶

In [17]:
plt.figure(figsize=(10, 6)) 
sns.histplot(data=POS, x='MONTHS_BALANCE', bins=10, kde=True, color='skyblue', edgecolor='black') 
plt.title('Histogram of Months Balance') 
plt.xlabel('Months Balance') 
plt.ylabel('Count') 
plt.show()

Testing Contract Status Variable¶

In [18]:
#Testing Contract Status Variable
plt.figure(figsize=(16,8))
sns.set_theme()
sns.countplot(x = 'NAME_CONTRACT_STATUS',data = POS)
plt.xlabel("Contract Status",fontweight='bold',size=13)
plt.ylabel("Number",fontweight='bold',size=13)
plt.show()

Correlation between all numerical vairables¶

In [21]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming 'POS' is your DataFrame
# If not already loaded, you can load it using: POS = pd.read_csv('your_dataset.csv')

# Select only numeric columns
numeric_columns = POS.select_dtypes(include='number').columns

# Exclude variables from the list
exclude_variables = ['SK_ID_PREV', 'SK_ID_CURR']
selected_variables = [col for col in numeric_columns if col not in exclude_variables]

# Create a DataFrame with selected variables
correlation_data = POS[selected_variables]

# Calculate the correlation matrix
correlation_matrix = correlation_data.corr()

# Plot the correlation matrix heatmap
plt.figure(figsize=(16, 14))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=1)
plt.title('Correlation Matrix for Installment Payments')
plt.show()

Finding 5¶

Defaulters among the highly categorical features are seen in most, highlighting Organization Type, Family Type , Occupation Type & Education.

Finding 6¶

Noticeable correlations:

Amount of credit and amount of goods price show a strong correlation. Days of birth and days employed exhibit a strong correlation. There is a strong correlation between external source 1 and days of birth. These observations suggest potential opportunities for feature engineering.

Dataset questions¶

Unique record for each SK_ID_CURR¶

In [27]:
list(datasets.keys())
Out[27]:
['application_train',
 'application_test',
 'bureau',
 'bureau_balance',
 'credit_card_balance',
 'installments_payments',
 'previous_application',
 'POS_CASH_balance']
In [28]:
len(datasets["application_train"]["SK_ID_CURR"].unique()) == datasets["application_train"].shape[0]
Out[28]:
True
In [29]:
# is there an overlap between the test and train customers 
np.intersect1d(datasets["application_train"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"])
Out[29]:
array([], dtype=int64)
In [30]:
# 
datasets["application_test"].shape
Out[30]:
(48744, 121)
In [31]:
datasets["application_train"].shape
Out[31]:
(307511, 122)

previous applications for the submission file¶

The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.

In [32]:
appsDF = datasets["previous_application"]
display(appsDF.head())
print(f"{appsDF.shape[0]:,} rows, {appsDF.shape[1]:,} columns")
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START ... NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 2030495 271877 Consumer loans 1730.430 17145.0 17145.0 0.0 17145.0 SATURDAY 15 ... Connectivity 12.0 middle POS mobile with interest 365243.0 -42.0 300.0 -42.0 -37.0 0.0
1 2802425 108129 Cash loans 25188.615 607500.0 679671.0 NaN 607500.0 THURSDAY 11 ... XNA 36.0 low_action Cash X-Sell: low 365243.0 -134.0 916.0 365243.0 365243.0 1.0
2 2523466 122040 Cash loans 15060.735 112500.0 136444.5 NaN 112500.0 TUESDAY 11 ... XNA 12.0 high Cash X-Sell: high 365243.0 -271.0 59.0 365243.0 365243.0 1.0
3 2819243 176158 Cash loans 47041.335 450000.0 470790.0 NaN 450000.0 MONDAY 7 ... XNA 12.0 middle Cash X-Sell: middle 365243.0 -482.0 -152.0 -182.0 -177.0 1.0
4 1784265 202054 Cash loans 31924.395 337500.0 404055.0 NaN 337500.0 THURSDAY 9 ... XNA 24.0 high Cash Street: high NaN NaN NaN NaN NaN NaN

5 rows × 37 columns

1,670,214 rows, 37 columns
In [33]:
print(f"There are  {appsDF.shape[0]:,} previous applications")
There are  1,670,214 previous applications
In [34]:
#Find the intersection of two arrays.
print(f'Number of train applicants with previous applications is {len(np.intersect1d(datasets["previous_application"]["SK_ID_CURR"], datasets["application_train"]["SK_ID_CURR"])):,}')
Number of train applicants with previous applications is 291,057
In [35]:
#Find the intersection of two arrays.
print(f'Number of train applicants with previous applications is {len(np.intersect1d(datasets["previous_application"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"])):,}')
Number of train applicants with previous applications is 47,800
In [36]:
# How many previous applciations  per applicant in the previous_application 
prevAppCounts = appsDF['SK_ID_CURR'].value_counts(dropna=False)
len(prevAppCounts[prevAppCounts >40])  #more that 40 previous applications
plt.hist(prevAppCounts[prevAppCounts>=0], bins=100)
plt.grid()
In [37]:
prevAppCounts[prevAppCounts >50].plot(kind='bar')
plt.xticks(rotation=25)
plt.show()

Histogram of Number of previous applications for an ID¶

In [38]:
sum(appsDF['SK_ID_CURR'].value_counts()==1)
Out[38]:
60458
In [39]:
plt.hist(appsDF['SK_ID_CURR'].value_counts(), cumulative =True, bins = 100);
plt.grid()
plt.ylabel('cumulative number of IDs')
plt.xlabel('Number of previous applications per ID')
plt.title('Histogram of Number of previous applications for an ID')
Out[39]:
Text(0.5, 1.0, 'Histogram of Number of previous applications for an ID')
Can we differentiate applications by low, medium and high previous apps?¶
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)
In [40]:
apps_all = appsDF['SK_ID_CURR'].nunique()
apps_5plus = appsDF['SK_ID_CURR'].value_counts()>=5
apps_40plus = appsDF['SK_ID_CURR'].value_counts()>=40
print('Percentage with 10 or more previous apps:', np.round(100.*(sum(apps_5plus)/apps_all),5))
print('Percentage with 40 or more previous apps:', np.round(100.*(sum(apps_40plus)/apps_all),5))
Percentage with 10 or more previous apps: 41.76895
Percentage with 40 or more previous apps: 0.03453

Joining secondary tables with the primary table¶

In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?

Joining previous_application with application_x¶

We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.

Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:

  • A simple feature could be the number of previous applications.
  • Other summary features of original features such as AMT_APPLICATION, AMT_CREDIT could be based on average, min, max, median, etc.

To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).

When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:

  1. Preprocess each of the non-application data sets, thereby generating many new (derived) features, and then joining (aka merge) the results with the application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]
  • Do the joins as part of the transformation steps. [Not recommended here. WHY?]. How can this be done? Will it work?
    • This would be necessary if we had dataset wide features such as IDF (inverse document frequency) which depend on the entire subset of data as opposed to a single loan application (e.g., a feature about the relative amount applied for such as the percentile of the loan amount being applied for).

I want you to think about this section and build on this.

Roadmap for secondary table processing¶

  1. Transform all the secondary tables to features that can be joined into the main table the application table (labeled and unlabeled)
    • 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments',
    • 'previous_application', 'POS_CASH_balance'
  • Merge the transformed secondary tables with the primary tables (i.e., the application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset)), thereby leading to X_train, y_train, X_valid, etc.
  • Proceed with the learning pipeline using X_train, y_train, X_valid, etc.
  • Generate a submission file using the learnt model
In [41]:
appsDF.columns
Out[41]:
Index(['SK_ID_PREV', 'SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'AMT_ANNUITY',
       'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
       'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
       'FLAG_LAST_APPL_PER_CONTRACT', 'NFLAG_LAST_APPL_IN_DAY',
       'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
       'RATE_INTEREST_PRIVILEGED', 'NAME_CASH_LOAN_PURPOSE',
       'NAME_CONTRACT_STATUS', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
       'CODE_REJECT_REASON', 'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE',
       'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE',
       'CHANNEL_TYPE', 'SELLERPLACE_AREA', 'NAME_SELLER_INDUSTRY',
       'CNT_PAYMENT', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION',
       'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
       'DAYS_LAST_DUE', 'DAYS_TERMINATION', 'NFLAG_INSURED_ON_APPROVAL'],
      dtype='object')
In [42]:
appsDF[0:50][(appsDF["SK_ID_CURR"]==175704)]
Out[42]:
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START ... NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
6 2315218 175704 Cash loans NaN 0.0 0.0 NaN NaN TUESDAY 11 ... XNA NaN XNA Cash NaN NaN NaN NaN NaN NaN

1 rows × 37 columns

In [43]:
appsDF[0:50][(appsDF["SK_ID_CURR"]==175704)]["AMT_CREDIT"]
Out[43]:
6    0.0
Name: AMT_CREDIT, dtype: float64
In [44]:
appsDF[0:50][(appsDF["SK_ID_CURR"]==175704) & ~(appsDF["AMT_CREDIT"]==1.0)]
Out[44]:
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START ... NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
6 2315218 175704 Cash loans NaN 0.0 0.0 NaN NaN TUESDAY 11 ... XNA NaN XNA Cash NaN NaN NaN NaN NaN NaN

1 rows × 37 columns

Missing values in prevApps¶

In [114]:
appsDF.isna().sum()
Out[114]:
SK_ID_PREV                           0
SK_ID_CURR                           0
NAME_CONTRACT_TYPE                   0
AMT_ANNUITY                     372235
AMT_APPLICATION                      0
AMT_CREDIT                           1
AMT_DOWN_PAYMENT                895844
AMT_GOODS_PRICE                 385515
WEEKDAY_APPR_PROCESS_START           0
HOUR_APPR_PROCESS_START              0
FLAG_LAST_APPL_PER_CONTRACT          0
NFLAG_LAST_APPL_IN_DAY               0
RATE_DOWN_PAYMENT               895844
RATE_INTEREST_PRIMARY          1664263
RATE_INTEREST_PRIVILEGED       1664263
NAME_CASH_LOAN_PURPOSE               0
NAME_CONTRACT_STATUS                 0
DAYS_DECISION                        0
NAME_PAYMENT_TYPE                    0
CODE_REJECT_REASON                   0
NAME_TYPE_SUITE                 820405
NAME_CLIENT_TYPE                     0
NAME_GOODS_CATEGORY                  0
NAME_PORTFOLIO                       0
NAME_PRODUCT_TYPE                    0
CHANNEL_TYPE                         0
SELLERPLACE_AREA                     0
NAME_SELLER_INDUSTRY                 0
CNT_PAYMENT                     372230
NAME_YIELD_GROUP                     0
PRODUCT_COMBINATION                346
DAYS_FIRST_DRAWING              673065
DAYS_FIRST_DUE                  673065
DAYS_LAST_DUE_1ST_VERSION       673065
DAYS_LAST_DUE                   673065
DAYS_TERMINATION                673065
NFLAG_INSURED_ON_APPROVAL       673065
dtype: int64
In [ ]:
 

Importing Required Packages¶

In [243]:
import numpy as np
import pandas as pd 
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
from time import time, ctime

from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.utils import resample

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, log_loss, classification_report, roc_auc_score, make_scorer
from scipy import stats
import json
from matplotlib import pyplot
from sklearn.model_selection import train_test_split

Features Aggregator¶

In [244]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline, Pipeline, FeatureUnion
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder

class FeaturesAggregator(BaseEstimator, TransformerMixin):
    def __init__(self, file_name, features=None):  # no *args or **kargs
        self.features = features
        self.agg_op_features = {}
        for f in self.features:
            temp = {f"{file_name}_{f}_{func}":func for func in ['min', 'max', 'mean', 'count', 'sum']}
            self.agg_op_features[f]=[(k, v) for k, v in temp.items()]

    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        group_cols = ["SK_ID_CURR"]
        result = X.groupby(group_cols).agg(self.agg_op_features)
        result.columns = result.columns.droplevel()
        result = result.reset_index(level=["SK_ID_CURR"])
        return result  # return dataframe with the join key "SK_ID_CURR"

Feature Engineering¶

In [245]:
class EngineerFeatures(BaseEstimator, TransformerMixin):
    def __init__(self, features=None):
        self
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        # Flag to represent when Total income is greater than Credit
        X['INCOME_GT_CREDIT_FLAG'] = X['AMT_INCOME_TOTAL'] > X['AMT_CREDIT']
        # Column to represent Credit Income Percent
        X['CREDIT_INCOME_PERCENT'] = X['AMT_CREDIT'] / X['AMT_INCOME_TOTAL']
        # Column to represent Annuity Income percent
        X['ANNUITY_INCOME_PERCENT'] = X['AMT_ANNUITY'] / X['AMT_INCOME_TOTAL']
        # Column to represent Credit Term
        X['CREDIT_TERM'] = X['AMT_CREDIT'] / X['AMT_ANNUITY'] 
        # Column to represent Days Employed percent in his life
        X['DAYS_EMPLOYED_PERCENT'] = X['DAYS_EMPLOYED'] / X['DAYS_BIRTH']
        return X
In [246]:
from sklearn.pipeline import make_pipeline, Pipeline, FeatureUnion

prevApps_features = ['AMT_ANNUITY', 'AMT_APPLICATION']
bureau_features = ['AMT_ANNUITY', 'AMT_CREDIT_SUM']
bureau_bal_features = ['MONTHS_BALANCE']
cc_bal_features = ['MONTHS_BALANCE', 'AMT_BALANCE', 'CNT_INSTALMENT_MATURE_CUM']
installments_pmnts_features = ['AMT_INSTALMENT', 'AMT_PAYMENT']

appsTrainDF = datasets['application_train']
engineer_features = EngineerFeatures()
appsTrainDF = engineer_features.transform(appsTrainDF)

prevAppsDF = datasets["previous_application"]
features_aggregator = FeaturesAggregator('prevApps', features=prevApps_features)
prevApps_aggregated = features_aggregator.transform(prevAppsDF)

bureauDF = datasets["bureau"]
features_aggregator = FeaturesAggregator('bureau', features=bureau_features)
bureau_aggregated = features_aggregator.transform(bureauDF)

#bureaubalDF = datasets['bureau_balance']
#features_aggregator = FeaturesAggregator(features=bureau_bal_features)
#prevApps_aggregated = features_aggregator.transform(bureaubalDF)

ccbalDF = datasets["credit_card_balance"]
features_aggregator = FeaturesAggregator('credit_card_balance', features=cc_bal_features)
ccblance_aggregated = features_aggregator.transform(ccbalDF)

installmentspaymentsDF = datasets["installments_payments"]
features_aggregator = FeaturesAggregator('credit_card_balance', features=installments_pmnts_features)
installments_pmnts_aggregated = features_aggregator.transform(installmentspaymentsDF)

Merging all Data¶

In [247]:
merge_all_data = True

# merge primary table and secondary tables using features based on meta data and aggregage stats 
if merge_all_data:
  appsTrainDF = appsTrainDF.merge(prevApps_aggregated, how='left', on='SK_ID_CURR')
  appsTrainDF = appsTrainDF.merge(bureau_aggregated, how='left', on="SK_ID_CURR")
  appsTrainDF = appsTrainDF.merge(ccblance_aggregated, how='left', on="SK_ID_CURR")
  appsTrainDF = appsTrainDF.merge(installments_pmnts_aggregated, how='left', on="SK_ID_CURR")
In [248]:
appsTrainDF.shape
Out[248]:
(307511, 172)

Plot Confusion Matrix¶

In [249]:
def plot_confusion_matrix(test_y, predicted_y):
    # Confusion matrix
    C = confusion_matrix(test_y, predicted_y)
    
    # Recall matrix
    A = (((C.T)/(C.sum(axis=1))).T)
    
    # Precision matrix
    B = (C/C.sum(axis=0))
    
    plt.figure(figsize=(20,4))
    
    labels = ['Re-paid(0)','Not Re-paid(1)']
    cmap=sns.light_palette("purple")
    plt.subplot(1,3,1)
    sns.heatmap(C, annot=True, cmap=cmap,fmt="d", xticklabels = labels, yticklabels=labels)
    plt.xlabel('Predicted Class')
    plt.ylabel('Orignal Class')
    plt.title('Confusion matrix')
    
    plt.subplot(1,3,2)
    sns.heatmap(A, annot=True, cmap=cmap, xticklabels = labels, yticklabels=labels)
    plt.xlabel('Predicted Class')
    plt.ylabel('Orignal Class')
    plt.title('Recall matrix')
    
    plt.subplot(1,3,3)
    sns.heatmap(B, annot=True, cmap=cmap, xticklabels = labels, yticklabels=labels)
    plt.xlabel('Predicted Class')
    plt.ylabel('Orignal Class')
    plt.title('Precision matrix')
    
    plt.show()

Initiating Experiment Log¶

In [261]:
try:
    expLog
except NameError:
    expLog = pd.DataFrame(columns=["exp_name", 
                                   "Train Acc", 
                                   "Valid Acc",
                                   "Test  Acc",
                                   "Train AUC", 
                                   "Valid AUC",
                                   "Test  AUC",
                                   "Train F1 Score",
                                   "Test F1 Score",                                   
                                   "Train Log Loss",
                                   "Test Log Loss",
                                   "P Score",
                                   "Train Time",
                                   "Test Time",
                                   "Description"
                                  ])
In [251]:
def pct(x):
    return round(100*x, 3)

List of all Numerical Variables¶

In [252]:
num_attribs = [
'AMT_INCOME_TOTAL',
'AMT_CREDIT',
'EXT_SOURCE_3',
'EXT_SOURCE_2',
'EXT_SOURCE_1',
'DAYS_EMPLOYED',
'DAYS_BIRTH',
'FLOORSMAX_AVG',
'FLOORSMAX_MEDI',
'FLOORSMAX_MODE',
'AMT_GOODS_PRICE',
'REGION_POPULATION_RELATIVE',
'ELEVATORS_AVG',
'REG_CITY_NOT_LIVE_CITY',
'FLAG_EMP_PHONE',
'REG_CITY_NOT_WORK_CITY',
'DAYS_ID_PUBLISH',
'DAYS_LAST_PHONE_CHANGE',
'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY',
## Highly correlated previous applications
'prevApps_AMT_ANNUITY_mean',
## Highly correlated Credit card balance features
'credit_card_balance_MONTHS_BALANCE_count',
'credit_card_balance_AMT_BALANCE_count',
'credit_card_balance_CNT_INSTALMENT_MATURE_CUM_count',
'credit_card_balance_CNT_INSTALMENT_MATURE_CUM_sum',
'credit_card_balance_MONTHS_BALANCE_sum',
'credit_card_balance_MONTHS_BALANCE_min',
'credit_card_balance_MONTHS_BALANCE_mean',
'credit_card_balance_AMT_BALANCE_min',
'credit_card_balance_AMT_BALANCE_max',
'credit_card_balance_AMT_BALANCE_mean'
]

List of all Categorical Variables¶

In [253]:
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE','NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']

List of all Selected Variables¶

In [254]:
selected_features = num_attribs + cat_attribs
tot_features = f"{len(selected_features)}:   Num:{len(num_attribs)},    Cat:{len(cat_attribs)}"
#Total Feature selected for processing
tot_features
Out[254]:
'38:   Num:31,    Cat:7'

Pipeline Coding¶

In [255]:
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values
    
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
    ])

num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', SimpleImputer(strategy='mean')),
        ('std_scaler', StandardScaler()),
    ])

data_prep_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])   

Splitting to a smaller Dataset¶

In [256]:
# Split Sample to feed the pipeline and it will result in a new dataset that is (1 / splits) the size 
splits = 3

# Train Test split percentage
subsample_rate = 0.3

train_dataset = appsTrainDF
finaldf = np.array_split(train_dataset, splits)
X_train = finaldf[0][selected_features]
y_train = finaldf[0]['TARGET']

## split part of data
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, stratify=y_train, test_size=subsample_rate, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, stratify=y_train, test_size=0.15, random_state=42)

print(f"X train           shape: {X_train.shape}")
print(f"X validation      shape: {X_valid.shape}")
print(f"X test            shape: {X_test.shape}")
X train           shape: (60989, 38)
X validation      shape: (10763, 38)
X test            shape: (30752, 38)

Shufflesplit¶

In [257]:
cvSplits = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0)
In [258]:
X_train.head(5)
Out[258]:
AMT_INCOME_TOTAL AMT_CREDIT EXT_SOURCE_3 EXT_SOURCE_2 EXT_SOURCE_1 DAYS_EMPLOYED DAYS_BIRTH FLOORSMAX_AVG FLOORSMAX_MEDI FLOORSMAX_MODE ... credit_card_balance_AMT_BALANCE_min credit_card_balance_AMT_BALANCE_max credit_card_balance_AMT_BALANCE_mean CODE_GENDER FLAG_OWN_REALTY FLAG_OWN_CAR NAME_CONTRACT_TYPE NAME_EDUCATION_TYPE OCCUPATION_TYPE NAME_INCOME_TYPE
40832 117000.0 157500.0 0.729567 0.262060 0.505998 -439.0 -16633.0 0.1667 0.1667 0.1667 ... NaN NaN NaN F 1 0 0 Secondary / secondary special Sales staff Working
36820 166500.0 900000.0 0.743559 0.451283 0.600909 365243.0 -22564.0 0.1667 0.1667 0.1667 ... 0.0 194627.34 40994.615602 F 1 1 0 Secondary / secondary special Laborers Pensioner
81804 90000.0 495000.0 0.535276 0.480293 0.505998 -434.0 -15989.0 0.1667 0.1667 0.1667 ... 0.0 0.00 0.000000 M 0 0 0 Secondary / secondary special Laborers Working
35092 112500.0 508495.5 0.722393 0.260275 0.505998 365243.0 -22918.0 0.1667 0.1667 0.1667 ... NaN NaN NaN F 1 1 0 Lower secondary Laborers Pensioner
57197 135000.0 400500.0 0.304672 0.526361 0.275414 -641.0 -12513.0 0.1667 0.1667 0.1667 ... 0.0 64450.80 3229.703053 M 1 0 0 Secondary / secondary special Core staff Working

5 rows × 38 columns

Logistic Regression with Dataprep pipeline including selected features¶

Experiment 1 with Imbalanced Data¶

In [259]:
pipeline = Pipeline([
    ("prep", data_prep_pipeline),
    ("clf", LogisticRegression(solver='saga',random_state=42))
])

start = time()
model = pipeline.fit(X_train, y_train)
np.random.seed(42)

# Set up cross validation scores 
logit_scores = cross_val_score(pipeline, X_train , y_train, cv=cvSplits)               
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)

# Time and score test predictions
start = time()
logit_score_test  = pipeline.score(X_test, y_test)
test_time = np.round(time() - start, 4)

Experiment 1 Logging the baseline pipeline with imbalanced Data¶

In [262]:
exp_name = f"Baseline_{len(selected_features)}_features"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [logit_score_train, 
                pct(accuracy_score(y_valid, model.predict(X_valid))),
                pct(accuracy_score(y_test, model.predict(X_test))),
                roc_auc_score(y_train, model.predict_proba(X_train)[:, 1]),
                roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
                roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
                f1_score(y_train, model.predict(X_train)),
                f1_score(y_test, model.predict(X_test)),
                log_loss(y_train, model.predict(X_train)),
                log_loss(y_test, model.predict(X_test)),0 ],4)) \
                + [train_time,test_time] + [f"Imbalanced Logistic reg with 20% training data"]
expLog
Out[262]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Test F1 Score Train Log Loss Test Log Loss P Score Train Time Test Time Description
0 Baseline_38_features 91.809 91.88 91.815 0.7412 0.7362 0.737 0.0139 0.0133 2.9319 2.9501 0.0 21.4906 0.1135 Imbalanced Logistic reg with 20% training data

AUC Curve for Experiment 1¶

In [263]:
from sklearn.metrics import roc_curve, auc
fpr , tpr , thresholds   = roc_curve(y_test,model.predict_proba(X_test)[:, 1])
fig = plt.figure(figsize=(10,8))
ax  = fig.add_subplot(111)
ax.plot(fpr,tpr,label   = ["Area under curve : ",auc(fpr,tpr)],linewidth=2,linestyle="dotted")
ax.plot([0,1],[0,1],linewidth=2,linestyle="dashed")
plt.legend(loc="best")
plt.title("ROC-CURVE & AREA UNDER CURVE")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
Out[263]:
Text(0, 0.5, 'True Positive Rate')

Confusion matrix for experiment 1¶

In [264]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming y_test is your true labels and model is your trained classifier
y_true = y_test
y_pred_proba = model.predict_proba(X_test)[:, 1]  # Probabilities for the positive class

# Convert probabilities to binary predictions (using a threshold, e.g., 0.5)
y_pred = (y_pred_proba > 0.5).astype(int)

# Create confusion matrix
cm = confusion_matrix(y_true, y_pred)

# Display the confusion matrix
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

Experiment 2 with balanced Dataset¶

Down-sample Majority Class¶

Since the dataset is imbalanced in the favor of majority samples being where the loans are repaid (TARGET=1) in a ratio more than 10:1, we can resample the data, by undersampling the majority class to make the data more balanced. So that we do not lose too much valuable data, the number of majority class samples is kept as twice the number of minority class.

In [265]:
# Down-sample Majority Class

train = pd.concat([X_train, y_train], axis=1)
count = train['TARGET'].value_counts()
num_majority = count[0]
num_minority = count[1]

#Number of undersampled majority class 2 x minority class
num_undersample_majority = 2 * num_minority

#separating majority and minority classes
df_majority = train[train["TARGET"] == 0]
df_minority = train[train["TARGET"] == 1]

df_majority_undersampled = resample(df_majority, replace=False, n_samples=num_undersample_majority, random_state=42)

df_undersampled = pd.concat([df_minority, df_majority_undersampled], axis=0)

#splitting dependent and independent variables
X_train = df_undersampled[selected_features]
y_train = df_undersampled['TARGET']

df_undersampled.TARGET.value_counts()
Out[265]:
TARGET
0.0    9902
1.0    4951
Name: count, dtype: int64

Splitting Data¶

In [266]:
cvSplits = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0)

Logistic Regression with Dataprep pipeline including selected features Exp 2¶

In [267]:
pipeline = Pipeline([
    ("prep", data_prep_pipeline),
    ("clf", LogisticRegression(solver='saga',random_state=42))
])

start = time()
model = pipeline.fit(X_train, y_train)
np.random.seed(42)

# Set up cross validation scores 
logit_scores = cross_val_score(pipeline, X_train , y_train, cv=cvSplits)               
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)

# Time and score test predictions
start = time()
logit_score_test  = pipeline.score(X_test, y_test)
test_time = np.round(time() - start, 4)

Experiment 2 Logging the baseline pipeline with balanced Data¶

In [268]:
exp_name = f"Baseline_{len(selected_features)}_features"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [logit_score_train, 
                pct(accuracy_score(y_valid, model.predict(X_valid))),
                pct(accuracy_score(y_test, model.predict(X_test))),
                roc_auc_score(y_train, model.predict_proba(X_train)[:, 1]),
                roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
                roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
                f1_score(y_train, model.predict(X_train)),
                f1_score(y_test, model.predict(X_test)),
                log_loss(y_train, model.predict(X_train)),
                log_loss(y_test, model.predict(X_test)),0 ],4)) \
                + [train_time,test_time] + [f"Balanced Logistic reg with 30% training data"]
expLog
Out[268]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Test F1 Score Train Log Loss Test Log Loss P Score Train Time Test Time Description
0 Baseline_38_features 91.809 91.880 91.815 0.7412 0.7362 0.7370 0.0139 0.0133 2.9319 2.9501 0.0 21.4906 0.1135 Imbalanced Logistic reg with 20% training data
1 Baseline_38_features 71.724 84.586 84.658 0.7433 0.7367 0.7365 0.4790 0.2823 10.0659 5.5299 0.0 4.5632 0.1214 Balanced Logistic reg with 30% training data

AUC Curve for Experiment 2¶

In [269]:
from sklearn.metrics import roc_curve, auc
fpr , tpr , thresholds   = roc_curve(y_test,model.predict_proba(X_test)[:, 1])
fig = plt.figure(figsize=(10,8))
ax  = fig.add_subplot(111)
ax.plot(fpr,tpr,label   = ["Area under curve : ",auc(fpr,tpr)],linewidth=2,linestyle="dotted")
ax.plot([0,1],[0,1],linewidth=2,linestyle="dashed")
plt.legend(loc="best")
plt.title("ROC-CURVE & AREA UNDER CURVE")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
Out[269]:
Text(0, 0.5, 'True Positive Rate')

Confusion matrix for experiment 2¶

In [270]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming y_test is your true labels and model is your trained classifier
y_true = y_test
y_pred_proba = model.predict_proba(X_test)[:, 1]  # Probabilities for the positive class

# Convert probabilities to binary predictions (using a threshold, e.g., 0.5)
y_pred = (y_pred_proba > 0.5).astype(int)

# Create confusion matrix
cm = confusion_matrix(y_true, y_pred)

# Display the confusion matrix
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

Advanced Feature Engineering¶

Benefits of Feature Engineering
An effective Feature Engineering implies:

  1. Higher efficiency of the model
  2. Easier Algorithms that fit the data
  3. Easier for Algorithms to detect patterns in the data
  4. Greater Flexibility of the features

image-2.png

Our feature engineering endeavors encompassed several key aspects, delineated as follows:

  1. Incorporating Domain-Specific Insights: The integration of custom domain knowledge played a pivotal role in the formulation of unique features tailored to our dataset.

  2. Crafting Engineered Aggregated Features: A deliberate effort was made to create novel aggregated features through meticulous engineering, enhancing the dataset's overall representational capacity.

  3. Exploratory Modeling of the Data: We delved into experimental modeling techniques, aiming to uncover hidden patterns and relationships within the dataset that might have eluded conventional analysis.

  4. Validation of Manual One-Hot Encoding (OHE): Rigorous validation processes were applied to ensure the accuracy and effectiveness of manually applied One-Hot Encoding, a critical step in categorical data representation.

  5. Polynomial Feature Expansion (Degree 4): A sophisticated approach involved the generation of polynomial features up to the fourth degree for select variables, amplifying the complexity and richness of the feature set.

  6. Comprehensive Dataset Merging: All relevant datasets were systematically merged, fostering a holistic view of the data and promoting comprehensive analyses.

  7. Pruning Columns with Missing Values: To enhance the dataset's integrity, columns with missing values were judiciously identified and subsequently removed, streamlining the dataset for further analysis.

image.png

A pivotal step in the feature engineering process involves the integration of domain knowledge-based features, a critical factor in enhancing model accuracy. Initially, we undertook the task of identifying these features for each dataset. Among the novel custom features introduced were metrics such as post-payment credit card balance relative to the due amount, average application amount, credit average, available credit as a percentage of income, annuity as a percentage of income, and annuity as a percentage of available credit.

Subsequently, we delved into numerical feature identification and aggregation, employing mean, minimum, and maximum values. Although an attempt was made to implement label encoding for unique values exceeding 5 during the engineering phase, a strategic decision led to the application of One-Hot Encoding (OHE) at the pipeline level. This targeted specific highly correlated fields in the final merged dataset, optimizing code management.

Extensive feature engineering was executed through multiple modeling approaches, involving primary, secondary, and tertiary tables, culminating in an optimized approach with minimal memory usage. The first attempt focused on creating engineered and aggregated features for Key-Level 3 tables, merging them with Key-Level 2 tables, and ultimately combining them with the primary dataset. However, this approach resulted in a surplus of redundant features, consuming significant memory.

In Attempt 2, a streamlined approach was adopted, creating custom and aggregated features for Key-Level 3 tables, merging them with Key-Level 2 tables based on the primary key, and extending this to Key-Level 1 tables using additional aggregated columns. This approach reduced duplicates, optimized memory usage, and employed a garbage collector after each merge.

In Attempt 3, the merged dataframe from the previous attempt was further enriched with polynomial features of degree 4. A final merge of Key-Level 3, Key-Level 2, and Key-Level 1 datasets formed the training dataframe, with meticulous attention to ensuring that no columns had more than 50% missing data.

The process of engineering and incorporating these features into the model, coupled with judicious splits during testing, initially yielded lower accuracy. However, deploying these merged features with well-considered splits during the training phase resulted in improved accuracy and diminished risk of overfitting, especially notable in models like Random Forest and XGBoost.

Future endeavors include implementing label encoding for all unique categorical values, exploring techniques such as PCA or custom functions to address multicollinearity in the pipeline, eliminating low-importance features, and evaluating their impact on model accuracy.

image.png

Loading data¶

In [397]:
data_app_train = freshdata['application_train']
data_app_test = freshdata['application_test']

Function to calculate missing values by column¶

In [398]:
# Function to calculate missing values by column
def missing_values(df):
        # Total missing values
        mis_val = df.isnull().sum()
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("The dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        return mis_val_table_ren_columns
In [399]:
missing_values(data_app_train)
The dataframe has 122 columns.
There are 67 columns that have missing values.
Out[399]:
Missing Values % of Total Values
COMMONAREA_MEDI 214865 69.9
COMMONAREA_AVG 214865 69.9
COMMONAREA_MODE 214865 69.9
NONLIVINGAPARTMENTS_MEDI 213514 69.4
NONLIVINGAPARTMENTS_MODE 213514 69.4
... ... ...
EXT_SOURCE_2 660 0.2
AMT_GOODS_PRICE 278 0.1
AMT_ANNUITY 12 0.0
CNT_FAM_MEMBERS 2 0.0
DAYS_LAST_PHONE_CHANGE 1 0.0

67 rows × 2 columns

Checking the outliers¶

In [400]:
data_app_train_num = data_app_train.select_dtypes(include=[np.number]).drop('SK_ID_CURR', axis = 1)
LowerOut = data_app_train_num.quantile(0.025)
HigherOut = data_app_train_num.quantile(0.975)
Outliers = (data_app_train_num < LowerOut) | (data_app_train_num > HigherOut)
print(Outliers.sum().sort_values())
TARGET                        0
FLAG_DOCUMENT_3               0
FLAG_DOCUMENT_6               0
FLAG_DOCUMENT_8               0
REG_CITY_NOT_WORK_CITY        0
                          ...  
DAYS_ID_PUBLISH           15338
EXT_SOURCE_2              15344
DAYS_REGISTRATION         15360
AMT_ANNUITY               15364
DAYS_BIRTH                15366
Length: 105, dtype: int64

Treating the missing value by median imputation for numerical values nd most frequent value imputation for categorical values.¶

In [401]:
from sklearn.impute import SimpleImputer

# select numerical columns 
numerical_col_train = data_app_train.select_dtypes(include=[np.number]).columns
numerical_col_test = data_app_test.select_dtypes(include=[np.number]).columns
# Selecting the categorical variables
categorical_col_train = data_app_train.select_dtypes(exclude=[np.number]).columns
categorical_col_test = data_app_test.select_dtypes(exclude=[np.number]).columns

# Numercial missing value imputation
imputer = SimpleImputer(missing_values=np.NaN, strategy='median')
data_app_train[numerical_col_train] = imputer.fit_transform(data_app_train[numerical_col_train])
data_app_test[numerical_col_test] = imputer.fit_transform(data_app_test[numerical_col_test])

# Categorical missing value imputation
imputer = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')
data_app_train[categorical_col_train] = imputer.fit_transform(data_app_train[categorical_col_train])
data_app_test[categorical_col_test] = imputer.fit_transform(data_app_test[categorical_col_test])

missing_values(data_app_train)
The dataframe has 122 columns.
There are 0 columns that have missing values.
Out[401]:
Missing Values % of Total Values

Checking the type of loans and sub categories¶

In [402]:
colors_target = ['#c2c2f0','#ffb3e6', '#c2c2f0','#ffb3e6', '#c2c2f0','#ffb3e6', '#c2c2f0','#ffb3e6']
fig, ax = plt.subplots(1, 3,figsize=(20, 7))
fig.suptitle("Type and purpose of loan", fontsize=12)
colors_target = ['#c2c2f0','#ffb3e6', '#c2c2f0','#ffb3e6', '#c2c2f0','#ffb3e6', '#c2c2f0','#ffb3e6']
perc = [str(round(e / s * 100., 1)) + '%' for s in (sum(data_app_train['NAME_CONTRACT_TYPE'].value_counts()),) for e in data_app_train['NAME_CONTRACT_TYPE'].value_counts()]
wedges, texts = ax[0].pie(data_app_train['NAME_CONTRACT_TYPE'].value_counts(), wedgeprops=dict(width=0.5), startangle=90)
ax[0].pie(data_app_train.groupby('NAME_CONTRACT_TYPE')['TARGET'].value_counts(),colors=colors_target,labels=[*['paid', 'not paid']*len(data_app_train['NAME_CONTRACT_TYPE'].value_counts())],radius=0.7,startangle=90, autopct='%1.1f%%', pctdistance=0.8, labeldistance=1.1, wedgeprops=dict(width=0.3))
centre_circle = plt.Circle((0,0),0.5,color='black', fc='white',linewidth=0)
bbox_props = dict(boxstyle="square,pad=0.3", fc="w", ec="k", lw=0.72)
kw = dict(arrowprops=dict(arrowstyle="-"),zorder=0, va="center")

for i, p in enumerate(wedges):
    ang = (p.theta2 - p.theta1)/2. + p.theta1
    y = np.sin(np.deg2rad(ang))
    x = np.cos(np.deg2rad(ang))
    horizontalalignment = {-1: "right", 1: "left"}[int(np.sign(x))]
    connectionstyle = "angle,angleA=0,angleB={}".format(ang)
    kw["arrowprops"].update({"connectionstyle": connectionstyle})
    ax[0].annotate(data_app_train['NAME_CONTRACT_TYPE'].unique()[i] + ' ' + perc[i], xy=(x, y), xytext=(0.5*np.sign(x), 1.4*y),
                horizontalalignment=horizontalalignment, **kw)
ax[0].set_title("Types of loan\nwith Target", fontsize=12, y=0.45)



data_app_train[data_app_train['FLAG_OWN_REALTY']=='Y'].groupby('TARGET').size().plot(kind='bar', color='#ff6666', ax=ax[1])
ax[1].set_title("Loan for owing realty")
ax[1].set_xticks([0, 1], ['Paid', 'Unpaid'])
ax[1].set_ylim(0, 200000)

data_app_train[data_app_train['FLAG_OWN_CAR']=='Y'].groupby('TARGET').size().plot(kind='bar', color='#ffcc99', ax=ax[2])
ax[2].set_title("Loan for owing car")
ax[2].set_xticks([0, 1], ['Paid', 'Unpaid'])
ax[2].set_ylim(0, 200000)
Out[402]:
(0.0, 200000.0)

Observation 7¶

The chart indicates that 82.9% of individuals with outstanding real estate loans have taken on cash loans, while 9.5% have taken on revolving loans. This suggests that cash loans are the preferred option for real estate purchases.

For car ownership loans, 90.5% of individuals have opted for cash loans, while only 0.5% have chosen revolving loans. This further highlights the preference for cash loans among individuals financing car purchases.

The chart also distinguishes between paid and unpaid loans. For real estate loans, 17.1% of cash loans remain unpaid, while for car ownership loans, 7.6% of cash loans remain unpaid. This indicates that a higher proportion of cash car loans are not yet paid off compared to cash real estate loans.

Verifying the different applicants with TARGET¶

In [403]:
# function to display horizontal bar chart
def barHorizontal(columns, ylables, title, tight=False):
    if tight:
        plt.figure(figsize=(20,15), tight_layout=True)
    else:
        plt.figure(figsize=(20,10), tight_layout=True)
    for index, col in enumerate(columns):
        plt.subplot(2, 3, index+1)
        barH = sns.countplot(y = col, data = data_app_train,  hue='TARGET', palette='Set2')
        barH.set_ylabel(ylables[index])
        barH.set_title(title[index])
        barH.legend(title="Target")
        barH.legend(title="Target", loc="lower right")
        sns.despine(bottom = True, left = True)
        for p in barH.patches:
            if tight:
                barH.annotate("%.0f" % p.get_width(), xy=(p.get_width(), p.get_y()+p.get_height()/2),
                        xytext=(5, 0), textcoords='offset points', ha="left", va="center",fontsize=5)
            else:
                barH.annotate("%.0f" % p.get_width(), xy=(p.get_width(), p.get_y()+p.get_height()/2),
                        xytext=(5, 0), textcoords='offset points', ha="left", va="center")
In [404]:
barHorizontal(['NAME_INCOME_TYPE', 'NAME_FAMILY_STATUS', 'NAME_EDUCATION_TYPE','NAME_TYPE_SUITE', 'NAME_HOUSING_TYPE'], ['Source', 'Status', 'Education Type', 'Type of suite', 'Type of house'], ["Income sources of Applicant", "Family status of the applicant", "Education of the applicant", "Who accompanined client when applying for the loan", "What type house was purchased by the applicant"], tight=False)
In [405]:
barHorizontal(['OCCUPATION_TYPE', 'ORGANIZATION_TYPE'], ['Type', 'Type of oragnization'], ["Occupation type of the applicant", "Types of Organizations"], tight = True)

Encoding the categorical values¶

In [406]:
# Number of unique classes in each object column
data_app_train.select_dtypes('object').nunique()
Out[406]:
NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64
In [407]:
# selecting the variables with 2 distinct categories
two_cat_col = data_app_train.select_dtypes('object').loc[:, list(data_app_train.select_dtypes('object').nunique()==2)]
two_cat_col.columns
Out[407]:
Index(['NAME_CONTRACT_TYPE', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
       'EMERGENCYSTATE_MODE'],
      dtype='object')
In [408]:
# Label encoding 2 distict values
label = LabelEncoder()
for col in two_cat_col.columns:
    label.fit(data_app_train[col])
    # Transform both training and testing data
    data_app_train[col] = label.transform(data_app_train[col])
    data_app_test[col] = label.transform(data_app_test[col])
data_app_train[two_cat_col.columns]
Out[408]:
NAME_CONTRACT_TYPE FLAG_OWN_CAR FLAG_OWN_REALTY EMERGENCYSTATE_MODE
0 0 0 1 0
1 0 0 0 0
2 1 1 1 0
3 0 0 1 0
4 0 0 1 0
... ... ... ... ...
307506 0 0 0 0
307507 0 0 1 0
307508 0 0 1 0
307509 0 0 1 0
307510 0 0 0 0

307511 rows × 4 columns

one-hot encoding of categorical variables more than 2 distinct values¶

In [409]:
# one-hot encoding of categorical variables more than 2 distinct values 
data_app_train = pd.get_dummies(data_app_train)
data_app_test = pd.get_dummies(data_app_test)

print('Training Features shape: ', data_app_train.shape)
print('Testing Features shape: ', data_app_test.shape)
Training Features shape:  (307511, 242)
Testing Features shape:  (48744, 238)
In [410]:
train_labels = data_app_train['TARGET']

# Align the training and testing data, keeping the columns present in both dataframes
data_app_train, data_app_test = data_app_train.align(data_app_test, join = 'inner', axis = 1)

# Add the target back in
data_app_train['TARGET'] = train_labels

print('Training Features shape: ', data_app_train.shape)
print('Testing Features shape: ', data_app_test.shape)
Training Features shape:  (307511, 239)
Testing Features shape:  (48744, 238)

Polynomial Features¶

Features created by raising existing features to an exponent. For example, if a dataset had one input feature X, then a polynomial feature would be the addition of a new feature (column) where values were calculated by squaring the values in X, e.g. X^2. This process can be repeated for each input variable in the dataset, creating a transformed version of each. we can create variables EXT_SOURCE_1^2 and EXT_SOURCE_2^2 and also variables such as EXT_SOURCE_1 x EXT_SOURCE_2, EXT_SOURCE_1 x EXT_SOURCE_2^2, EXT_SOURCE_1^2 x EXT_SOURCE_2^2, and so on.

In [411]:
from sklearn.preprocessing import PolynomialFeatures
# Make a new dataframe for polynomial features
feature_col = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']
poly_features = data_app_train[feature_col]
poly_features_test = data_app_test[feature_col]

                                  
# Create the polynomial object with specified degree 3
poly_transformer = PolynomialFeatures(degree = 3)

Train the polynomial features¶

In [412]:
# Train the polynomial features
poly_transformer.fit(poly_features)

# Transform the features
poly_features = poly_transformer.transform(poly_features)
poly_features_test = poly_transformer.transform(poly_features_test)
print('Polynomial Features shape: ', poly_features.shape)
Polynomial Features shape:  (307511, 35)

There are 35 features with individual features raised to powers up to degree 3 and interaction terms. Now, we can see whether any of these new features are correlated with the target.

Correlation¶

The correlational value indidcations

  • .00-.19 “very weak”
  • .20-.39 “weak”
  • .40-.59 “moderate”
  • .60-.79 “strong”
  • .80-1.0 “very strong”

And negative sign indicates, negative relation, opposite direction.

In [415]:
# creating the dataframe from the created variables
poly_features = pd.DataFrame(poly_features, columns = poly_transformer.get_feature_names_out(['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']))
poly_features['TARGET'] = data_app_train['TARGET']

plt.figure(figsize=(20, 10))
# Seting the range of values to be displayed on the colormap from -1 to 1, and set the annotation to True to display the correlation values on the heatmap.
heatmap = sns.heatmap(poly_features.corr(), vmin=-1, vmax=1, annot=True, cmap="BrBG")
heatmap.set_title('Correlation Heatmap with R values for polynomial features', fontdict={'fontsize':12}, pad=12)
Out[415]:
Text(0.5, 1.0, 'Correlation Heatmap with R values for polynomial features')

Observation 8¶

Some of the derived features have more correlation with the target than the original features

Other dataset (verifying correaltion to create new Features)¶

Client's previous loans at other financial institutions

In [416]:
# Put test features into dataframe
poly_features_test = pd.DataFrame(poly_features_test, 
                                  columns = poly_transformer.get_feature_names_out(['EXT_SOURCE_1', 'EXT_SOURCE_2', 
                                                                                'EXT_SOURCE_3', 'DAYS_BIRTH']))

# Merge polynomial features into training dataframe
poly_features['SK_ID_CURR'] = data_app_train['SK_ID_CURR']
data_app_train_poly = data_app_train.merge(poly_features, on = 'SK_ID_CURR', how = 'left')

# Merge polnomial features into testing dataframe
poly_features_test['SK_ID_CURR'] = data_app_test['SK_ID_CURR']
data_app_test_poly = data_app_test.merge(poly_features_test, on = 'SK_ID_CURR', how = 'left')

# Align the dataframes
data_app_train_poly, data_app_test_poly = data_app_train_poly.align(data_app_test_poly, join = 'inner', axis = 1)

# Print out the new shapes
print('Training data with polynomial features shape: ', data_app_train_poly.shape)
print('Testing data with polynomial features shape:  ', data_app_test_poly.shape)
Training data with polynomial features shape:  (307511, 273)
Testing data with polynomial features shape:   (48744, 273)
In [420]:
data_bureau = freshdata['bureau']
data_bureau.head()
Out[420]:
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 215354 5714462 Closed currency 1 -497 0 -153.0 -153.0 NaN 0 91323.0 0.0 NaN 0.0 Consumer credit -131 NaN
1 215354 5714463 Active currency 1 -208 0 1075.0 NaN NaN 0 225000.0 171342.0 NaN 0.0 Credit card -20 NaN
2 215354 5714464 Active currency 1 -203 0 528.0 NaN NaN 0 464323.5 NaN NaN 0.0 Consumer credit -16 NaN
3 215354 5714465 Active currency 1 -203 0 NaN NaN NaN 0 90000.0 NaN NaN 0.0 Credit card -16 NaN
4 215354 5714466 Active currency 1 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN NaN 0.0 Consumer credit -21 NaN
In [421]:
# Groupby the client id (SK_ID_CURR), count the number of previous loans, and rename the column
previous_loan_counts = data_bureau.groupby('SK_ID_CURR', as_index=False)['SK_ID_BUREAU'].count().rename(columns = {'SK_ID_BUREAU': 'previous_loan_counts'})
previous_loan_counts.head()
Out[421]:
SK_ID_CURR previous_loan_counts
0 100001 7
1 100002 8
2 100003 4
3 100004 2
4 100005 3
In [422]:
train_data_copy = data_app_train.copy()
# Join to the training dataframe
train_data_copy = train_data_copy.merge(previous_loan_counts, on = 'SK_ID_CURR', how = 'left')

# Filling the missing value with the mean
train_data_copy['previous_loan_counts'] = train_data_copy['previous_loan_counts'].fillna(train_data_copy['previous_loan_counts'].mean())

# Checking the correlation with the target variables
corr = train_data_copy['TARGET'].corr(train_data_copy['previous_loan_counts'])
corr
Out[422]:
0.003680828614269069

Observation 9¶

The correlation of new variable with target variable is very low.

In [423]:
# Filtering the connection variable between datasets (SK_ID_CURR) and target variable
train_data_copy = data_app_train.loc[:,['SK_ID_CURR', 'TARGET']]
#Function to check the correlation of the varables from other dataset with target variable
def otherDatasetVerfication(df):
    # grouping the data on the basis of the current client ID
    # groupedData = df.groupby('SK_ID_CURR', as_index=False).mean()
    categorical = pd.get_dummies(df)
    # Creating new merged dataset 
    data_new = pd.merge(train_data_copy, categorical, on='SK_ID_CURR', how="left")
    # Calculating relation with numerical data
    correlations_data = data_new.select_dtypes(include=[np.number]).corr()['TARGET'].sort_values()
    print('Most Positive Correlations:\n', correlations_data.tail(5))
    print('\nMost Negative Correlations:\n', correlations_data.head(5))

data_bureau Corelations¶

In [424]:
otherDatasetVerfication(data_bureau)
Most Positive Correlations:
 DAYS_CREDIT_ENDDATE    0.026497
DAYS_ENDDATE_FACT      0.039057
DAYS_CREDIT_UPDATE     0.041076
DAYS_CREDIT            0.061556
TARGET                 1.000000
Name: TARGET, dtype: float64

Most Negative Correlations:
 AMT_CREDIT_SUM         -0.010606
SK_ID_BUREAU           -0.009018
AMT_CREDIT_SUM_LIMIT   -0.005990
SK_ID_CURR             -0.002900
AMT_ANNUITY             0.000117
Name: TARGET, dtype: float64

credit_card_balance corelations¶

In [425]:
data_credit_card_balance = freshdata['credit_card_balance']
otherDatasetVerfication(data_credit_card_balance)
Most Positive Correlations:
 AMT_RECEIVABLE_PRINCIPAL    0.049692
AMT_RECIVABLE               0.049803
AMT_TOTAL_RECEIVABLE        0.049839
AMT_BALANCE                 0.050098
TARGET                      1.000000
Name: TARGET, dtype: float64

Most Negative Correlations:
 CNT_INSTALMENT_MATURE_CUM    -0.023684
SK_ID_CURR                   -0.004412
SK_DPD                        0.001684
SK_ID_PREV                    0.002571
CNT_DRAWINGS_OTHER_CURRENT    0.003044
Name: TARGET, dtype: float64

POS_CASH_balance corelations¶

In [426]:
data_POS_CASH_balance = freshdata['POS_CASH_balance']
otherDatasetVerfication(data_POS_CASH_balance)
Most Positive Correlations:
 SK_DPD                   0.009866
CNT_INSTALMENT           0.018506
MONTHS_BALANCE           0.020147
CNT_INSTALMENT_FUTURE    0.021972
TARGET                   1.000000
Name: TARGET, dtype: float64

Most Negative Correlations:
 SK_ID_CURR       -0.002244
SK_ID_PREV       -0.000056
SK_DPD_DEF        0.008594
SK_DPD            0.009866
CNT_INSTALMENT    0.018506
Name: TARGET, dtype: float64

previous_application corelations¶

In [427]:
data_previous_application = freshdata['previous_application']
otherDatasetVerfication(data_previous_application)
Most Positive Correlations:
 DAYS_LAST_DUE_1ST_VERSION    0.018021
RATE_INTEREST_PRIVILEGED     0.028640
CNT_PAYMENT                  0.030480
DAYS_DECISION                0.039901
TARGET                       1.000000
Name: TARGET, dtype: float64

Most Negative Correlations:
 DAYS_FIRST_DRAWING        -0.031154
HOUR_APPR_PROCESS_START   -0.027809
RATE_DOWN_PAYMENT         -0.026111
AMT_DOWN_PAYMENT          -0.016918
AMT_ANNUITY               -0.014922
Name: TARGET, dtype: float64

Normalization¶

In [428]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

cols_train = data_app_train.columns
cols_test = data_app_test.columns
# transform data
data_app_train = pd.DataFrame(scaler.fit_transform(data_app_train), columns=cols_train)
data_app_test = pd.DataFrame(scaler.fit_transform(data_app_test), columns=cols_test)

Certainly! Here's the information formatted in Markdown:

Domain Features¶

We have introduced four new features based on financial knowledge:

  1. CREDIT_INCOME_PERCENT:

    • Definition: Percentage of the credit amount relative to a client's income.
    • Calculation: Credit Amount/Client’s Income×100
    • Interpretation: Indicates the proportion of the client's income dedicated to repaying the credit.
  2. ANNUITY_INCOME_PERCENT:

    • Definition: Percentage of the loan annuity relative to a client's income.
    • Calculation: Loan Annuity/Client's Income x100
    • Interpretation: Reflects the share of the client's income allocated to loan payments.
  3. CREDIT_TERM:

    • Definition: Length of the payment period in months (considering the annuity is the monthly amount due).
    • Calculation: Number of months required to repay the loan.
    • Interpretation: Provides information on the duration of the loan.
  4. DAYS_EMPLOYED_PERCENT:

    • Definition: Percentage of days employed relative to the client's age.
    • Calculation: Days Employed/Client's Age x100
    • Interpretation: Indicates the portion of the client's life spent employed.

These features offer a more nuanced understanding of a client's financial profile, considering income, loan terms, and employment history. When incorporated into predictive models, they contribute to a more comprehensive assessment of creditworthiness.

In [429]:
app_train_domain = data_app_train.copy()
app_test_domain = data_app_test.copy()

app_train_domain['CREDIT_INCOME_PERCENT'] = app_train_domain['AMT_CREDIT'] / app_train_domain['AMT_INCOME_TOTAL']
app_train_domain['ANNUITY_INCOME_PERCENT'] = app_train_domain['AMT_ANNUITY'] / app_train_domain['AMT_INCOME_TOTAL']
app_train_domain['CREDIT_TERM'] = app_train_domain['AMT_ANNUITY'] / app_train_domain['AMT_CREDIT']
app_train_domain['DAYS_EMPLOYED_PERCENT'] = app_train_domain['DAYS_EMPLOYED'] / app_train_domain['DAYS_BIRTH']

app_test_domain['CREDIT_INCOME_PERCENT'] = app_test_domain['AMT_CREDIT'] / app_test_domain['AMT_INCOME_TOTAL']
app_test_domain['ANNUITY_INCOME_PERCENT'] = app_test_domain['AMT_ANNUITY'] / app_test_domain['AMT_INCOME_TOTAL']
app_test_domain['CREDIT_TERM'] = app_test_domain['AMT_ANNUITY'] / app_test_domain['AMT_CREDIT']
app_test_domain['DAYS_EMPLOYED_PERCENT'] = app_test_domain['DAYS_EMPLOYED'] / app_test_domain['DAYS_BIRTH']

app_train_domain.replace([-np.inf, np.inf], np.nan, inplace=True)
In [430]:
plt.figure(figsize=(20, 5))
# Seting the range of values to be displayed on the colormap from -1 to 1, and set the annotation to True to display the correlation values on the heatmap.
heatmap = sns.heatmap(app_train_domain.loc[:,['CREDIT_INCOME_PERCENT', 'ANNUITY_INCOME_PERCENT', 'CREDIT_TERM', 'DAYS_EMPLOYED_PERCENT', 'TARGET']].corr(), vmin=-1, vmax=1, annot=True, cmap=sns.diverging_palette(230, 20, as_cmap=True), center=0,square=True)
heatmap.set_title('Correlation Heatmap with R values', fontdict={'fontsize':12}, pad=12)
Out[430]:
Text(0.5, 1.0, 'Correlation Heatmap with R values')

Observation 10¶

  1. The heatmap shows the correlation between the following variables: credit income percent, annuity income percent, credit term, days employed percent, and target.

  2. The correlation coefficients range from -1 to 1, where -1 indicates a perfect negative correlation, 1 indicates a perfect positive correlation, and 0 indicates no linear correlation.

  3. The strongest positive correlation is between credit income percent and days employed percent (0.75), which means that people with higher credit income percent are more likely to be employed for a longer number of days.

  4. The strongest negative correlation is between target and credit term (-0.75), which means that people with longer credit terms are less likely to achieve the target.

Other notable correlations include:

Credit income percent and annuity income percent (0.87)
Credit income percent and target (0.75)
Annuity income percent and target (0.008)
Days employed percent and target (-0.028)

Interpretation:

  1. The heatmap suggests that credit income percent and days employed percent are the most important variables for predicting the target. People with higher credit income percent and days employed percent are more likely to achieve the target.

  2. The negative correlation between credit term and target suggests that longer credit terms may be detrimental to achieving the target. This could be because longer credit terms typically result in higher interest payments, which can make it more difficult to save money and achieve financial goals.

Overall, the heatmap provides insights into the relationships between the different variables and can be used to identify the most important factors for predicting the target.

Additional notes:

It is important to note that correlation does not equal causation. Just because two variables are correlated does not mean that one causes the other. The heatmap is based on a specific dataset, and the correlations may not be generalizable to other populations. It is also important to consider the magnitude of the correlation coefficients. Correlations that are close to zero may not be statistically significant.

In [431]:
# spliting the dataset
from sklearn.model_selection import StratifiedKFold, train_test_split 

data_model_train = data_app_train.drop('TARGET', axis=1)
target = data_app_train['TARGET']
X_train_simple, X_test_simple, y_train_simple, y_test_simple = train_test_split(data_model_train,target, test_size=0.3, random_state=0)
X_train_simple, X_valid_simple, y_train_simple, y_valid_simple = train_test_split(X_train_simple, y_train_simple, test_size=0.15, random_state=42)

Experiment 3 - Advanced feature training with Logistic Regression on Imbalanced Dataset¶

In [271]:
newpipeline = Pipeline([
    ("clf", LogisticRegression(solver='saga',random_state=42))
])

start = time()
model = newpipeline.fit(X_train_simple, y_train_simple)
np.random.seed(42)

# Set up cross validation scores 
logit_scores = cross_val_score(newpipeline, X_train_simple , y_train_simple, cv=3)               
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)

# Time and score test predictions
start = time()
logit_score_test  = newpipeline.score(X_test_simple, y_test_simple)
test_time = np.round(time() - start, 4)
In [275]:
exp_name = f"Baseline_advanced_features"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [logit_score_train, 
                pct(accuracy_score(y_valid_simple, model.predict(X_valid_simple))),
                pct(accuracy_score(y_test_simple, model.predict(X_test_simple))),
                roc_auc_score(y_train_simple, model.predict_proba(X_train_simple)[:, 1]),
                roc_auc_score(y_valid_simple, model.predict_proba(X_valid_simple)[:, 1]),
                roc_auc_score(y_test_simple, model.predict_proba(X_test_simple)[:, 1]),
                f1_score(y_train_simple, model.predict(X_train_simple)),
                f1_score(y_test_simple, model.predict(X_test_simple)),
                log_loss(y_train_simple, model.predict(X_train_simple)),
                log_loss(y_test_simple, model.predict(X_test_simple)),0 ],4)) \
                + [train_time,test_time] + [f"experiment  3 -> Imbalanced Logistic reg with advanced features"]
expLog
Out[275]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Test F1 Score Train Log Loss Test Log Loss P Score Train Time Test Time Description
0 Baseline_38_features 91.809 91.880 91.815 0.7412 0.7362 0.7370 0.0139 0.0133 2.9319 2.9501 0.0 21.4906 0.1135 Imbalanced Logistic reg with 20% training data
1 Baseline_38_features 71.724 84.586 84.658 0.7433 0.7367 0.7365 0.4790 0.2823 10.0659 5.5299 0.0 4.5632 0.1214 Balanced Logistic reg with 30% training data
2 Baseline_advanced_features 91.875 91.855 92.022 0.7463 0.7474 0.7460 0.0220 0.0187 2.9258 2.8756 0.0 173.1410 0.0576 experiment 3 -> Imbalanced Logistic reg with ...

ROC Curve for Experiment 3¶

In [280]:
from sklearn.metrics import roc_curve, auc
fpr , tpr , thresholds   = roc_curve(y_test_simple,model.predict_proba(X_test_simple)[:, 1])
fig = plt.figure(figsize=(10,8))
ax  = fig.add_subplot(111)
ax.plot(fpr,tpr,label   = ["Area under curve : ",auc(fpr,tpr)],linewidth=2,linestyle="dotted")
ax.plot([0,1],[0,1],linewidth=2,linestyle="dashed")
plt.legend(loc="best")
plt.title("ROC-CURVE & AREA UNDER CURVE")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
Out[280]:
Text(0, 0.5, 'True Positive Rate')

Confusion Matrix for Experiment 3¶

In [281]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming y_test is your true labels and model is your trained classifier
y_true_simple = y_test_simple
y_pred_proba_simple = model.predict_proba(X_test_simple)[:, 1]  # Probabilities for the positive class

# Convert probabilities to binary predictions (using a threshold, e.g., 0.5)
y_pred_simple = (y_pred_proba_simple > 0.5).astype(int)

# Create confusion matrix
cm = confusion_matrix(y_true_simple, y_pred_simple)

# Display the confusion matrix
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

Experiment 4 - Advanced feature training with Decision Tree on Imbalanced Dataset¶

In [282]:
from sklearn.tree import DecisionTreeClassifier
newpipeline = Pipeline([
    ("clf", DecisionTreeClassifier())
])

start = time()
model = newpipeline.fit(X_train_simple, y_train_simple)
np.random.seed(42)

# Set up cross validation scores 
logit_scores = cross_val_score(newpipeline, X_train_simple , y_train_simple, cv=3)               
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)

# Time and score test predictions
start = time()
logit_score_test  = newpipeline.score(X_test_simple, y_test_simple)
test_time = np.round(time() - start, 4)
In [283]:
exp_name = f"Baseline_advanced_features_with_DT"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [logit_score_train, 
                pct(accuracy_score(y_valid_simple, model.predict(X_valid_simple))),
                pct(accuracy_score(y_test_simple, model.predict(X_test_simple))),
                roc_auc_score(y_train_simple, model.predict_proba(X_train_simple)[:, 1]),
                roc_auc_score(y_valid_simple, model.predict_proba(X_valid_simple)[:, 1]),
                roc_auc_score(y_test_simple, model.predict_proba(X_test_simple)[:, 1]),
                f1_score(y_train_simple, model.predict(X_train_simple)),
                f1_score(y_test_simple, model.predict(X_test_simple)),
                log_loss(y_train_simple, model.predict(X_train_simple)),
                log_loss(y_test_simple, model.predict(X_test_simple)),0 ],4)) \
                + [train_time,test_time] + [f"experiment  3 -> Imbalanced DecisionTree with advanced features"]
expLog
Out[283]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Test F1 Score Train Log Loss Test Log Loss P Score Train Time Test Time Description
0 Baseline_38_features 91.809 91.880 91.815 0.7412 0.7362 0.7370 0.0139 0.0133 2.9319 2.9501 0.0 21.4906 0.1135 Imbalanced Logistic reg with 20% training data
1 Baseline_38_features 71.724 84.586 84.658 0.7433 0.7367 0.7365 0.4790 0.2823 10.0659 5.5299 0.0 4.5632 0.1214 Balanced Logistic reg with 30% training data
2 Baseline_advanced_features 91.875 91.855 92.022 0.7463 0.7474 0.7460 0.0220 0.0187 2.9258 2.8756 0.0 173.1410 0.0576 experiment 3 -> Imbalanced Logistic reg with ...
3 Baseline_advanced_features_with_DT 85.088 85.178 85.378 1.0000 0.5393 0.5377 1.0000 0.1498 0.0000 5.2702 0.0 94.3183 0.1299 experiment 3 -> Imbalanced DecisionTree with ...

ROC Curve for Experiment 4¶

In [284]:
from sklearn.metrics import roc_curve, auc
fpr , tpr , thresholds   = roc_curve(y_test_simple,model.predict_proba(X_test_simple)[:, 1])
fig = plt.figure(figsize=(10,8))
ax  = fig.add_subplot(111)
ax.plot(fpr,tpr,label   = ["Area under curve : ",auc(fpr,tpr)],linewidth=2,linestyle="dotted")
ax.plot([0,1],[0,1],linewidth=2,linestyle="dashed")
plt.legend(loc="best")
plt.title("ROC-CURVE & AREA UNDER CURVE")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
Out[284]:
Text(0, 0.5, 'True Positive Rate')

Confusion matrix for Experiment 4¶

In [285]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming y_test is your true labels and model is your trained classifier
y_true_simple = y_test_simple
y_pred_proba_simple = model.predict_proba(X_test_simple)[:, 1]  # Probabilities for the positive class

# Convert probabilities to binary predictions (using a threshold, e.g., 0.5)
y_pred_simple = (y_pred_proba_simple > 0.5).astype(int)

# Create confusion matrix
cm = confusion_matrix(y_true_simple, y_pred_simple)

# Display the confusion matrix
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

Experiment 5 - Advanced feature training with RandomForest on Imbalanced Dataset¶

In [286]:
from sklearn.ensemble import RandomForestClassifier
newpipeline = Pipeline([
    ("clf", RandomForestClassifier(n_estimators = 100))
])

start = time()
model = newpipeline.fit(X_train_simple, y_train_simple)
np.random.seed(42)

# Set up cross validation scores 
logit_scores = cross_val_score(newpipeline, X_train_simple , y_train_simple, cv=3)               
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)

# Time and score test predictions
start = time()
logit_score_test  = newpipeline.score(X_test_simple, y_test_simple)
test_time = np.round(time() - start, 4)
In [290]:
exp_name = f"Baseline_advanced_features_with_randmforest"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [logit_score_train, 
                pct(accuracy_score(y_valid_simple, model.predict(X_valid_simple))),
                pct(accuracy_score(y_test_simple, model.predict(X_test_simple))),
                roc_auc_score(y_train_simple, model.predict_proba(X_train_simple)[:, 1]),
                roc_auc_score(y_valid_simple, model.predict_proba(X_valid_simple)[:, 1]),
                roc_auc_score(y_test_simple, model.predict_proba(X_test_simple)[:, 1]),
                f1_score(y_train_simple, model.predict(X_train_simple)),
                f1_score(y_test_simple, model.predict(X_test_simple)),
                log_loss(y_train_simple, model.predict(X_train_simple)),
                log_loss(y_test_simple, model.predict(X_test_simple)),0 ],4)) \
                + [train_time,test_time] + [f"Imbalanced randomforest with advanced features"]
expLog
Out[290]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Test F1 Score Train Log Loss Test Log Loss P Score Train Time Test Time Description
0 Baseline_38_features 91.809 91.880 91.815 0.7412 0.7362 0.7370 0.0139 0.0133 2.9319 2.9501 0.0 21.4906 0.1135 Imbalanced Logistic reg with 20% training data
1 Baseline_38_features 71.724 84.586 84.658 0.7433 0.7367 0.7365 0.4790 0.2823 10.0659 5.5299 0.0 4.5632 0.1214 Balanced Logistic reg with 30% training data
2 Baseline_advanced_features 91.875 91.855 92.022 0.7463 0.7474 0.7460 0.0220 0.0187 2.9258 2.8756 0.0 173.1410 0.0576 experiment 3 -> Imbalanced Logistic reg with ...
3 Baseline_advanced_features_with_DT 85.088 85.178 85.378 1.0000 0.5393 0.5377 1.0000 0.1498 0.0000 5.2702 0.0 94.3183 0.1299 experiment 3 -> Imbalanced DecisionTree with ...
4 Baseline_advanced_features_with_randmforest 91.881 91.861 92.045 1.0000 0.7178 0.7116 0.9998 0.0016 0.0014 2.8673 0.0 424.2478 5.6705 Imbalanced randomforest with advanced features

ROC Curve for Experiment 5¶

In [291]:
from sklearn.metrics import roc_curve, auc
fpr , tpr , thresholds   = roc_curve(y_test_simple,model.predict_proba(X_test_simple)[:, 1])
fig = plt.figure(figsize=(10,8))
ax  = fig.add_subplot(111)
ax.plot(fpr,tpr,label   = ["Area under curve : ",auc(fpr,tpr)],linewidth=2,linestyle="dotted")
ax.plot([0,1],[0,1],linewidth=2,linestyle="dashed")
plt.legend(loc="best")
plt.title("ROC-CURVE & AREA UNDER CURVE")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
Out[291]:
Text(0, 0.5, 'True Positive Rate')

Confusion Matrix for experiment 5¶

In [292]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming y_test is your true labels and model is your trained classifier
y_true_simple = y_test_simple
y_pred_proba_simple = model.predict_proba(X_test_simple)[:, 1]  # Probabilities for the positive class

# Convert probabilities to binary predictions (using a threshold, e.g., 0.5)
y_pred_simple = (y_pred_proba_simple > 0.5).astype(int)

# Create confusion matrix
cm = confusion_matrix(y_true_simple, y_pred_simple)

# Display the confusion matrix
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

Experiment 6 - Advanced feature training with Bagging on Imbalanced Dataset¶

In [293]:
from sklearn.ensemble import BaggingClassifier

newpipeline = Pipeline([
    ("clf", BaggingClassifier(n_estimators=50, random_state=0))
])

start = time()
model = newpipeline.fit(X_train_simple, y_train_simple)
np.random.seed(42)

# Set up cross validation scores 
logit_scores = cross_val_score(newpipeline, X_train_simple , y_train_simple, cv=3)               
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)

# Time and score test predictions
start = time()
logit_score_test  = newpipeline.score(X_test_simple, y_test_simple)
test_time = np.round(time() - start, 4)
In [296]:
exp_name = f"Baseline_advanced_features_with_bagging"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [logit_score_train, 
                pct(accuracy_score(y_valid_simple, model.predict(X_valid_simple))),
                pct(accuracy_score(y_test_simple, model.predict(X_test_simple))),
                roc_auc_score(y_train_simple, model.predict_proba(X_train_simple)[:, 1]),
                roc_auc_score(y_valid_simple, model.predict_proba(X_valid_simple)[:, 1]),
                roc_auc_score(y_test_simple, model.predict_proba(X_test_simple)[:, 1]),
                f1_score(y_train_simple, model.predict(X_train_simple)),
                f1_score(y_test_simple, model.predict(X_test_simple)),
                log_loss(y_train_simple, model.predict(X_train_simple)),
                log_loss(y_test_simple, model.predict(X_test_simple)),0 ],4)) \
                + [train_time,test_time] + [f"Imbalanced bagging with advanced features"]
expLog
Out[296]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Test F1 Score Train Log Loss Test Log Loss P Score Train Time Test Time Description
0 Baseline_38_features 91.809 91.880 91.815 0.7412 0.7362 0.7370 0.0139 0.0133 2.9319 2.9501 0.0 21.4906 0.1135 Imbalanced Logistic reg with 20% training data
1 Baseline_38_features 71.724 84.586 84.658 0.7433 0.7367 0.7365 0.4790 0.2823 10.0659 5.5299 0.0 4.5632 0.1214 Balanced Logistic reg with 30% training data
2 Baseline_advanced_features 91.875 91.855 92.022 0.7463 0.7474 0.7460 0.0220 0.0187 2.9258 2.8756 0.0 173.1410 0.0576 experiment 3 -> Imbalanced Logistic reg with ...
3 Baseline_advanced_features_with_DT 85.088 85.178 85.378 1.0000 0.5393 0.5377 1.0000 0.1498 0.0000 5.2702 0.0 94.3183 0.1299 experiment 3 -> Imbalanced DecisionTree with ...
4 Baseline_advanced_features_with_randmforest 91.881 91.861 92.045 1.0000 0.7178 0.7116 0.9998 0.0016 0.0014 2.8673 0.0 424.2478 5.6705 Imbalanced randomforest with advanced features
5 Baseline_advanced_features_with_bagging 91.855 91.889 91.986 1.0000 0.6935 0.6904 0.9948 0.0307 0.0301 2.8884 0.0 3266.9368 7.8258 Imbalanced bagging with advanced features

ROC Curve for Experiment 6¶

In [295]:
from sklearn.metrics import roc_curve, auc
fpr , tpr , thresholds   = roc_curve(y_test_simple,model.predict_proba(X_test_simple)[:, 1])
fig = plt.figure(figsize=(10,8))
ax  = fig.add_subplot(111)
ax.plot(fpr,tpr,label   = ["Area under curve : ",auc(fpr,tpr)],linewidth=2,linestyle="dotted")
ax.plot([0,1],[0,1],linewidth=2,linestyle="dashed")
plt.legend(loc="best")
plt.title("ROC-CURVE & AREA UNDER CURVE")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
Out[295]:
Text(0, 0.5, 'True Positive Rate')

Confusion matrix for experiment 6¶

In [297]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming y_test is your true labels and model is your trained classifier
y_true_simple = y_test_simple
y_pred_proba_simple = model.predict_proba(X_test_simple)[:, 1]  # Probabilities for the positive class

# Convert probabilities to binary predictions (using a threshold, e.g., 0.5)
y_pred_simple = (y_pred_proba_simple > 0.5).astype(int)

# Create confusion matrix
cm = confusion_matrix(y_true_simple, y_pred_simple)

# Display the confusion matrix
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

Experiment 7 - Advanced feature training with Boosting on Imbalanced Dataset¶

In [298]:
from xgboost import XGBClassifier
from sklearn.ensemble import BaggingClassifier

newpipeline = Pipeline([
    ("clf", XGBClassifier(n_estimators=100))
])

start = time()
model = newpipeline.fit(X_train_simple, y_train_simple)
np.random.seed(42)

# Set up cross validation scores 
logit_scores = cross_val_score(newpipeline, X_train_simple , y_train_simple, cv=3)               
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)

# Time and score test predictions
start = time()
logit_score_test  = newpipeline.score(X_test_simple, y_test_simple)
test_time = np.round(time() - start, 4)
In [299]:
exp_name = f"Baseline_advanced_features_with_boosting"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [logit_score_train, 
                pct(accuracy_score(y_valid_simple, model.predict(X_valid_simple))),
                pct(accuracy_score(y_test_simple, model.predict(X_test_simple))),
                roc_auc_score(y_train_simple, model.predict_proba(X_train_simple)[:, 1]),
                roc_auc_score(y_valid_simple, model.predict_proba(X_valid_simple)[:, 1]),
                roc_auc_score(y_test_simple, model.predict_proba(X_test_simple)[:, 1]),
                f1_score(y_train_simple, model.predict(X_train_simple)),
                f1_score(y_test_simple, model.predict(X_test_simple)),
                log_loss(y_train_simple, model.predict(X_train_simple)),
                log_loss(y_test_simple, model.predict(X_test_simple)),0 ],4)) \
                + [train_time,test_time] + [f"Imbalanced boosting with advanced features"]
expLog
Out[299]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Test F1 Score Train Log Loss Test Log Loss P Score Train Time Test Time Description
0 Baseline_38_features 91.809 91.880 91.815 0.7412 0.7362 0.7370 0.0139 0.0133 2.9319 2.9501 0.0 21.4906 0.1135 Imbalanced Logistic reg with 20% training data
1 Baseline_38_features 71.724 84.586 84.658 0.7433 0.7367 0.7365 0.4790 0.2823 10.0659 5.5299 0.0 4.5632 0.1214 Balanced Logistic reg with 30% training data
2 Baseline_advanced_features 91.875 91.855 92.022 0.7463 0.7474 0.7460 0.0220 0.0187 2.9258 2.8756 0.0 173.1410 0.0576 experiment 3 -> Imbalanced Logistic reg with ...
3 Baseline_advanced_features_with_DT 85.088 85.178 85.378 1.0000 0.5393 0.5377 1.0000 0.1498 0.0000 5.2702 0.0 94.3183 0.1299 experiment 3 -> Imbalanced DecisionTree with ...
4 Baseline_advanced_features_with_randmforest 91.881 91.861 92.045 1.0000 0.7178 0.7116 0.9998 0.0016 0.0014 2.8673 0.0 424.2478 5.6705 Imbalanced randomforest with advanced features
5 Baseline_advanced_features_with_bagging 91.855 91.889 91.986 1.0000 0.6935 0.6904 0.9948 0.0307 0.0301 2.8884 0.0 3266.9368 7.8258 Imbalanced bagging with advanced features
6 Baseline_advanced_features_with_boosting 91.823 91.846 92.000 0.8621 0.7506 0.7468 0.1608 0.0611 2.7114 2.8834 0.0 57.8918 0.1648 Imbalanced boosting with advanced features

ROC Curve for experiment 7¶

In [300]:
from sklearn.metrics import roc_curve, auc
fpr , tpr , thresholds   = roc_curve(y_test_simple,model.predict_proba(X_test_simple)[:, 1])
fig = plt.figure(figsize=(10,8))
ax  = fig.add_subplot(111)
ax.plot(fpr,tpr,label   = ["Area under curve : ",auc(fpr,tpr)],linewidth=2,linestyle="dotted")
ax.plot([0,1],[0,1],linewidth=2,linestyle="dashed")
plt.legend(loc="best")
plt.title("ROC-CURVE & AREA UNDER CURVE")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
Out[300]:
Text(0, 0.5, 'True Positive Rate')

Confusion matrix for Experiment 7¶

In [301]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming y_test is your true labels and model is your trained classifier
y_true_simple = y_test_simple
y_pred_proba_simple = model.predict_proba(X_test_simple)[:, 1]  # Probabilities for the positive class

# Convert probabilities to binary predictions (using a threshold, e.g., 0.5)
y_pred_simple = (y_pred_proba_simple > 0.5).astype(int)

# Create confusion matrix
cm = confusion_matrix(y_true_simple, y_pred_simple)

# Display the confusion matrix
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

Implementing oversampling with SMOTE¶

image.png

In [432]:
# function to plot and verify the balance of the dataset
def verifyBalance(balanced_dataset):
    colors = ['#ff6666', '#ffcc99', '#99ff99', '#66b3ff']
    fig =plt.figure(figsize=(8,3), tight_layout=True)
    plt.subplot(1, 2, 1)
    balanced_dataset.astype(int).plot.hist(color=colors)
    plt.tick_params(top='off', bottom='off', left='off', right='off')
    plt.xticks([0,1])
    plt.subplot(1, 2, 2)
    balanced_dataset.value_counts().plot(kind='pie', autopct='%1.0f%%', title="Data types", colors=colors)
In [433]:
balanced_dataset = pd.concat([data_app_train[(data_app_train['TARGET']==0)].sample(frac=0.088, random_state=0), data_app_train[(data_app_train['TARGET']==1)]])

verifyBalance(balanced_dataset['TARGET'])
In [434]:
# after varification of the balance split the dataset 
target_balanced_sample = balanced_dataset['TARGET']

balanced_dataset_model_train = balanced_dataset.drop('TARGET', axis=1)

X_train_balanced_sample, X_test_balanced_sample, y_train_balanced_sample, y_test_balanced_sample = train_test_split(balanced_dataset_model_train,target_balanced_sample, test_size=0.4, random_state=0)
In [435]:
from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X, y = oversample.fit_resample(data_model_train, target)

verifyBalance(y)
In [436]:
# after varification of the balance split the dataset 
from sklearn.model_selection import StratifiedKFold, train_test_split
X_train_balanced_smote, X_test_balanced_smote, y_train_balanced_smote, y_test_balanced_smote = train_test_split(X,y, test_size=0.4, random_state=42)
X_train_balanced_smote, X_valid_balanced_smote, y_train_balanced_smote, y_valid_balanced_smote = train_test_split(X_train_balanced_smote,y_train_balanced_smote, test_size=0.15, random_state=42)

Experiment 8 - Oversampled LogisticRegression with Advanced Features¶

In [309]:
newpipeline = Pipeline([
    ("clf", LogisticRegression())
])

start = time()
model = newpipeline.fit(X_train_balanced_smote, y_train_balanced_smote)
np.random.seed(42)

# Set up cross validation scores 
logit_scores = cross_val_score(newpipeline, X_train_balanced_smote , y_train_balanced_smote, cv=3)               
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)

# Time and score test predictions
start = time()
logit_score_test  = newpipeline.score(X_test_balanced_smote, y_test_balanced_smote)
test_time = np.round(time() - start, 4)
In [314]:
exp_name = f"Oversampled LogisticRegression_with_advanced_features"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [logit_score_train, 
                pct(accuracy_score(y_valid_balanced_smote, model.predict(X_valid_balanced_smote))),
                pct(accuracy_score(y_test_balanced_smote, model.predict(X_test_balanced_smote))),
                roc_auc_score(y_train_balanced_smote, model.predict_proba(X_train_balanced_smote)[:, 1]),
                roc_auc_score(y_valid_balanced_smote, model.predict_proba(X_valid_balanced_smote)[:, 1]),
                roc_auc_score(y_test_balanced_smote, model.predict_proba(X_test_balanced_smote)[:, 1]),
                f1_score(y_train_balanced_smote, model.predict(X_train_balanced_smote)),
                f1_score(y_test_balanced_smote, model.predict(X_test_balanced_smote)),
                log_loss(y_train_balanced_smote, model.predict(X_train_balanced_smote)),
                log_loss(y_test_balanced_smote, model.predict(X_test_balanced_smote)),0 ],4)) \
                + [train_time,test_time] + [f"Oversampled LogisticRegression_with_advanced_features"]
expLog
Out[314]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Test F1 Score Train Log Loss Test Log Loss P Score Train Time Test Time Description
0 Baseline_38_features 91.809 91.880 91.815 0.7412 0.7362 0.7370 0.0139 0.0133 2.9319 2.9501 0.0 21.4906 0.1135 Imbalanced Logistic reg with 20% training data
1 Baseline_38_features 71.724 84.586 84.658 0.7433 0.7367 0.7365 0.4790 0.2823 10.0659 5.5299 0.0 4.5632 0.1214 Balanced Logistic reg with 30% training data
2 Baseline_advanced_features 91.875 91.855 92.022 0.7463 0.7474 0.7460 0.0220 0.0187 2.9258 2.8756 0.0 173.1410 0.0576 experiment 3 -> Imbalanced Logistic reg with ...
3 Baseline_advanced_features_with_DT 85.088 85.178 85.378 1.0000 0.5393 0.5377 1.0000 0.1498 0.0000 5.2702 0.0 94.3183 0.1299 experiment 3 -> Imbalanced DecisionTree with ...
4 Baseline_advanced_features_with_randmforest 91.881 91.861 92.045 1.0000 0.7178 0.7116 0.9998 0.0016 0.0014 2.8673 0.0 424.2478 5.6705 Imbalanced randomforest with advanced features
5 Baseline_advanced_features_with_bagging 91.855 91.889 91.986 1.0000 0.6935 0.6904 0.9948 0.0307 0.0301 2.8884 0.0 3266.9368 7.8258 Imbalanced bagging with advanced features
6 Baseline_advanced_features_with_boosting 91.823 91.846 92.000 0.8621 0.7506 0.7468 0.1608 0.0611 2.7114 2.8834 0.0 57.8918 0.1648 Imbalanced boosting with advanced features
7 Oversampled LogisticRegression_with_advanced_f... 70.673 71.060 70.587 0.7745 0.7777 0.7732 0.7091 0.7083 10.5552 10.6016 0.0 33.6503 0.2514 Oversampled LogisticRegression_with_advanced_f...

ROC Curve for Experiment 8¶

In [315]:
from sklearn.metrics import roc_curve, auc
fpr , tpr , thresholds   = roc_curve(y_test_balanced_smote,model.predict_proba(X_test_balanced_smote)[:, 1])
fig = plt.figure(figsize=(10,8))
ax  = fig.add_subplot(111)
ax.plot(fpr,tpr,label   = ["Area under curve : ",auc(fpr,tpr)],linewidth=2,linestyle="dotted")
ax.plot([0,1],[0,1],linewidth=2,linestyle="dashed")
plt.legend(loc="best")
plt.title("ROC-CURVE & AREA UNDER CURVE")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
Out[315]:
Text(0, 0.5, 'True Positive Rate')

Confusion Matrix for Experiment 8¶

In [316]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming y_test is your true labels and model is your trained classifier
y_true_balanced_smote = y_test_balanced_smote
y_pred_proba_balanced_smote = model.predict_proba(X_test_balanced_smote)[:, 1]  # Probabilities for the positive class

# Convert probabilities to binary predictions (using a threshold, e.g., 0.5)
y_pred_balanced_smote = (y_pred_proba_balanced_smote > 0.5).astype(int)

# Create confusion matrix
cm = confusion_matrix(y_true_balanced_smote, y_pred_balanced_smote)

# Display the confusion matrix
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

Experiment 9 - Oversampled Decisiontree with Advanced Features¶

In [317]:
newpipeline = Pipeline([
    ("clf", DecisionTreeClassifier())
])

start = time()
model = newpipeline.fit(X_train_balanced_smote, y_train_balanced_smote)
np.random.seed(42)

# Set up cross validation scores 
logit_scores = cross_val_score(newpipeline, X_train_balanced_smote , y_train_balanced_smote, cv=3)               
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)

# Time and score test predictions
start = time()
logit_score_test  = newpipeline.score(X_test_balanced_smote, y_test_balanced_smote)
test_time = np.round(time() - start, 4)
In [318]:
exp_name = f"Oversampled_DecisionTree_with_advanced_features"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [logit_score_train, 
                pct(accuracy_score(y_valid_balanced_smote, model.predict(X_valid_balanced_smote))),
                pct(accuracy_score(y_test_balanced_smote, model.predict(X_test_balanced_smote))),
                roc_auc_score(y_train_balanced_smote, model.predict_proba(X_train_balanced_smote)[:, 1]),
                roc_auc_score(y_valid_balanced_smote, model.predict_proba(X_valid_balanced_smote)[:, 1]),
                roc_auc_score(y_test_balanced_smote, model.predict_proba(X_test_balanced_smote)[:, 1]),
                f1_score(y_train_balanced_smote, model.predict(X_train_balanced_smote)),
                f1_score(y_test_balanced_smote, model.predict(X_test_balanced_smote)),
                log_loss(y_train_balanced_smote, model.predict(X_train_balanced_smote)),
                log_loss(y_test_balanced_smote, model.predict(X_test_balanced_smote)),0 ],4)) \
                + [train_time,test_time] + [f"Oversampled_DecisionTree_with_advanced_features"]
expLog
Out[318]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Test F1 Score Train Log Loss Test Log Loss P Score Train Time Test Time Description
0 Baseline_38_features 91.809 91.880 91.815 0.7412 0.7362 0.7370 0.0139 0.0133 2.9319 2.9501 0.0 21.4906 0.1135 Imbalanced Logistic reg with 20% training data
1 Baseline_38_features 71.724 84.586 84.658 0.7433 0.7367 0.7365 0.4790 0.2823 10.0659 5.5299 0.0 4.5632 0.1214 Balanced Logistic reg with 30% training data
2 Baseline_advanced_features 91.875 91.855 92.022 0.7463 0.7474 0.7460 0.0220 0.0187 2.9258 2.8756 0.0 173.1410 0.0576 experiment 3 -> Imbalanced Logistic reg with ...
3 Baseline_advanced_features_with_DT 85.088 85.178 85.378 1.0000 0.5393 0.5377 1.0000 0.1498 0.0000 5.2702 0.0 94.3183 0.1299 experiment 3 -> Imbalanced DecisionTree with ...
4 Baseline_advanced_features_with_randmforest 91.881 91.861 92.045 1.0000 0.7178 0.7116 0.9998 0.0016 0.0014 2.8673 0.0 424.2478 5.6705 Imbalanced randomforest with advanced features
5 Baseline_advanced_features_with_bagging 91.855 91.889 91.986 1.0000 0.6935 0.6904 0.9948 0.0307 0.0301 2.8884 0.0 3266.9368 7.8258 Imbalanced bagging with advanced features
6 Baseline_advanced_features_with_boosting 91.823 91.846 92.000 0.8621 0.7506 0.7468 0.1608 0.0611 2.7114 2.8834 0.0 57.8918 0.1648 Imbalanced boosting with advanced features
7 Oversampled LogisticRegression_with_advanced_f... 70.673 71.060 70.587 0.7745 0.7777 0.7732 0.7091 0.7083 10.5552 10.6016 0.0 33.6503 0.2514 Oversampled LogisticRegression_with_advanced_f...
8 Oversampled_DecisionTree_with_advanced_features 88.913 89.250 89.282 1.0000 0.8925 0.8928 1.0000 0.8938 0.0000 3.8632 0.0 111.9326 0.4028 Oversampled_DecisionTree_with_advanced_features

ROC Curve for Experiment 9¶

In [319]:
from sklearn.metrics import roc_curve, auc
fpr , tpr , thresholds   = roc_curve(y_test_balanced_smote,model.predict_proba(X_test_balanced_smote)[:, 1])
fig = plt.figure(figsize=(10,8))
ax  = fig.add_subplot(111)
ax.plot(fpr,tpr,label   = ["Area under curve : ",auc(fpr,tpr)],linewidth=2,linestyle="dotted")
ax.plot([0,1],[0,1],linewidth=2,linestyle="dashed")
plt.legend(loc="best")
plt.title("ROC-CURVE & AREA UNDER CURVE")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
Out[319]:
Text(0, 0.5, 'True Positive Rate')

Confusion matrix for experiment 9¶

In [320]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming y_test is your true labels and model is your trained classifier
y_true_balanced_smote = y_test_balanced_smote
y_pred_proba_balanced_smote = model.predict_proba(X_test_balanced_smote)[:, 1]  # Probabilities for the positive class

# Convert probabilities to binary predictions (using a threshold, e.g., 0.5)
y_pred_balanced_smote = (y_pred_proba_balanced_smote > 0.5).astype(int)

# Create confusion matrix
cm = confusion_matrix(y_true_balanced_smote, y_pred_balanced_smote)

# Display the confusion matrix
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

Experiment 10 - Oversampled RandomForest with Advanced Features¶

In [321]:
newpipeline = Pipeline([
    ("clf", RandomForestClassifier(n_estimators = 100))
])

start = time()
model = newpipeline.fit(X_train_balanced_smote, y_train_balanced_smote)
np.random.seed(42)

# Set up cross validation scores 
logit_scores = cross_val_score(newpipeline, X_train_balanced_smote , y_train_balanced_smote, cv=3)               
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)

# Time and score test predictions
start = time()
logit_score_test  = newpipeline.score(X_test_balanced_smote, y_test_balanced_smote)
test_time = np.round(time() - start, 4)
In [322]:
exp_name = f"Oversampled_RandomForest_with_advanced_features"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [logit_score_train, 
                pct(accuracy_score(y_valid_balanced_smote, model.predict(X_valid_balanced_smote))),
                pct(accuracy_score(y_test_balanced_smote, model.predict(X_test_balanced_smote))),
                roc_auc_score(y_train_balanced_smote, model.predict_proba(X_train_balanced_smote)[:, 1]),
                roc_auc_score(y_valid_balanced_smote, model.predict_proba(X_valid_balanced_smote)[:, 1]),
                roc_auc_score(y_test_balanced_smote, model.predict_proba(X_test_balanced_smote)[:, 1]),
                f1_score(y_train_balanced_smote, model.predict(X_train_balanced_smote)),
                f1_score(y_test_balanced_smote, model.predict(X_test_balanced_smote)),
                log_loss(y_train_balanced_smote, model.predict(X_train_balanced_smote)),
                log_loss(y_test_balanced_smote, model.predict(X_test_balanced_smote)),0 ],4)) \
                + [train_time,test_time] + [f"Oversampled_RandomForest_with_advanced_features"]
expLog
Out[322]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Test F1 Score Train Log Loss Test Log Loss P Score Train Time Test Time Description
0 Baseline_38_features 91.809 91.880 91.815 0.7412 0.7362 0.7370 0.0139 0.0133 2.9319 2.9501 0.0 21.4906 0.1135 Imbalanced Logistic reg with 20% training data
1 Baseline_38_features 71.724 84.586 84.658 0.7433 0.7367 0.7365 0.4790 0.2823 10.0659 5.5299 0.0 4.5632 0.1214 Balanced Logistic reg with 30% training data
2 Baseline_advanced_features 91.875 91.855 92.022 0.7463 0.7474 0.7460 0.0220 0.0187 2.9258 2.8756 0.0 173.1410 0.0576 experiment 3 -> Imbalanced Logistic reg with ...
3 Baseline_advanced_features_with_DT 85.088 85.178 85.378 1.0000 0.5393 0.5377 1.0000 0.1498 0.0000 5.2702 0.0 94.3183 0.1299 experiment 3 -> Imbalanced DecisionTree with ...
4 Baseline_advanced_features_with_randmforest 91.881 91.861 92.045 1.0000 0.7178 0.7116 0.9998 0.0016 0.0014 2.8673 0.0 424.2478 5.6705 Imbalanced randomforest with advanced features
5 Baseline_advanced_features_with_bagging 91.855 91.889 91.986 1.0000 0.6935 0.6904 0.9948 0.0307 0.0301 2.8884 0.0 3266.9368 7.8258 Imbalanced bagging with advanced features
6 Baseline_advanced_features_with_boosting 91.823 91.846 92.000 0.8621 0.7506 0.7468 0.1608 0.0611 2.7114 2.8834 0.0 57.8918 0.1648 Imbalanced boosting with advanced features
7 Oversampled LogisticRegression_with_advanced_f... 70.673 71.060 70.587 0.7745 0.7777 0.7732 0.7091 0.7083 10.5552 10.6016 0.0 33.6503 0.2514 Oversampled LogisticRegression_with_advanced_f...
8 Oversampled_DecisionTree_with_advanced_features 88.913 89.250 89.282 1.0000 0.8925 0.8928 1.0000 0.8938 0.0000 3.8632 0.0 111.9326 0.4028 Oversampled_DecisionTree_with_advanced_features
9 Oversampled_RandomForest_with_advanced_features 94.578 95.048 95.144 1.0000 0.9827 0.9824 1.0000 0.9493 0.0000 1.7503 0.0 765.1550 11.8012 Oversampled_RandomForest_with_advanced_features

ROC Curve for Experiment 10¶

In [323]:
from sklearn.metrics import roc_curve, auc
fpr , tpr , thresholds   = roc_curve(y_test_balanced_smote,model.predict_proba(X_test_balanced_smote)[:, 1])
fig = plt.figure(figsize=(10,8))
ax  = fig.add_subplot(111)
ax.plot(fpr,tpr,label   = ["Area under curve : ",auc(fpr,tpr)],linewidth=2,linestyle="dotted")
ax.plot([0,1],[0,1],linewidth=2,linestyle="dashed")
plt.legend(loc="best")
plt.title("ROC-CURVE & AREA UNDER CURVE")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
Out[323]:
Text(0, 0.5, 'True Positive Rate')

Confusion matrix for experiment 10¶

In [324]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming y_test is your true labels and model is your trained classifier
y_true_balanced_smote = y_test_balanced_smote
y_pred_proba_balanced_smote = model.predict_proba(X_test_balanced_smote)[:, 1]  # Probabilities for the positive class

# Convert probabilities to binary predictions (using a threshold, e.g., 0.5)
y_pred_balanced_smote = (y_pred_proba_balanced_smote > 0.5).astype(int)

# Create confusion matrix
cm = confusion_matrix(y_true_balanced_smote, y_pred_balanced_smote)

# Display the confusion matrix
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

Experiment 11 - Oversampled Boosting with Advanced Features¶

In [326]:
newpipeline = Pipeline([
    ("clf", XGBClassifier(n_estimators=100))
])

start = time()
model = newpipeline.fit(X_train_balanced_smote, y_train_balanced_smote)
np.random.seed(42)

# Set up cross validation scores 
logit_scores = cross_val_score(newpipeline, X_train_balanced_smote , y_train_balanced_smote, cv=3)               
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)

# Time and score test predictions
start = time()
logit_score_test  = newpipeline.score(X_test_balanced_smote, y_test_balanced_smote)
test_time = np.round(time() - start, 4)
In [327]:
exp_name = f"Oversampled_BaggingClassifier_with_advanced_features"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [logit_score_train, 
                pct(accuracy_score(y_valid_balanced_smote, model.predict(X_valid_balanced_smote))),
                pct(accuracy_score(y_test_balanced_smote, model.predict(X_test_balanced_smote))),
                roc_auc_score(y_train_balanced_smote, model.predict_proba(X_train_balanced_smote)[:, 1]),
                roc_auc_score(y_valid_balanced_smote, model.predict_proba(X_valid_balanced_smote)[:, 1]),
                roc_auc_score(y_test_balanced_smote, model.predict_proba(X_test_balanced_smote)[:, 1]),
                f1_score(y_train_balanced_smote, model.predict(X_train_balanced_smote)),
                f1_score(y_test_balanced_smote, model.predict(X_test_balanced_smote)),
                log_loss(y_train_balanced_smote, model.predict(X_train_balanced_smote)),
                log_loss(y_test_balanced_smote, model.predict(X_test_balanced_smote)),0 ],4)) \
                + [train_time,test_time] + [f"Oversampled_BaggingClassifier_with_advanced_features"]
expLog
Out[327]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Test F1 Score Train Log Loss Test Log Loss P Score Train Time Test Time Description
0 Baseline_38_features 91.809 91.880 91.815 0.7412 0.7362 0.7370 0.0139 0.0133 2.9319 2.9501 0.0 21.4906 0.1135 Imbalanced Logistic reg with 20% training data
1 Baseline_38_features 71.724 84.586 84.658 0.7433 0.7367 0.7365 0.4790 0.2823 10.0659 5.5299 0.0 4.5632 0.1214 Balanced Logistic reg with 30% training data
2 Baseline_advanced_features 91.875 91.855 92.022 0.7463 0.7474 0.7460 0.0220 0.0187 2.9258 2.8756 0.0 173.1410 0.0576 experiment 3 -> Imbalanced Logistic reg with ...
3 Baseline_advanced_features_with_DT 85.088 85.178 85.378 1.0000 0.5393 0.5377 1.0000 0.1498 0.0000 5.2702 0.0 94.3183 0.1299 experiment 3 -> Imbalanced DecisionTree with ...
4 Baseline_advanced_features_with_randmforest 91.881 91.861 92.045 1.0000 0.7178 0.7116 0.9998 0.0016 0.0014 2.8673 0.0 424.2478 5.6705 Imbalanced randomforest with advanced features
5 Baseline_advanced_features_with_bagging 91.855 91.889 91.986 1.0000 0.6935 0.6904 0.9948 0.0307 0.0301 2.8884 0.0 3266.9368 7.8258 Imbalanced bagging with advanced features
6 Baseline_advanced_features_with_boosting 91.823 91.846 92.000 0.8621 0.7506 0.7468 0.1608 0.0611 2.7114 2.8834 0.0 57.8918 0.1648 Imbalanced boosting with advanced features
7 Oversampled LogisticRegression_with_advanced_f... 70.673 71.060 70.587 0.7745 0.7777 0.7732 0.7091 0.7083 10.5552 10.6016 0.0 33.6503 0.2514 Oversampled LogisticRegression_with_advanced_f...
8 Oversampled_DecisionTree_with_advanced_features 88.913 89.250 89.282 1.0000 0.8925 0.8928 1.0000 0.8938 0.0000 3.8632 0.0 111.9326 0.4028 Oversampled_DecisionTree_with_advanced_features
9 Oversampled_RandomForest_with_advanced_features 94.578 95.048 95.144 1.0000 0.9827 0.9824 1.0000 0.9493 0.0000 1.7503 0.0 765.1550 11.8012 Oversampled_RandomForest_with_advanced_features
10 Oversampled_BaggingClassifier_with_advanced_fe... 95.432 95.458 95.500 0.9856 0.9773 0.9774 0.9563 0.9531 1.5117 1.6219 0.0 114.6474 0.6990 Oversampled_BaggingClassifier_with_advanced_fe...

ROC Curve for Experiment 11¶

In [328]:
from sklearn.metrics import roc_curve, auc
fpr , tpr , thresholds   = roc_curve(y_test_balanced_smote,model.predict_proba(X_test_balanced_smote)[:, 1])
fig = plt.figure(figsize=(10,8))
ax  = fig.add_subplot(111)
ax.plot(fpr,tpr,label   = ["Area under curve : ",auc(fpr,tpr)],linewidth=2,linestyle="dotted")
ax.plot([0,1],[0,1],linewidth=2,linestyle="dashed")
plt.legend(loc="best")
plt.title("ROC-CURVE & AREA UNDER CURVE")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
Out[328]:
Text(0, 0.5, 'True Positive Rate')

Confusion Matrix for Exeriment 11¶

In [329]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming y_test is your true labels and model is your trained classifier
y_true_balanced_smote = y_test_balanced_smote
y_pred_proba_balanced_smote = model.predict_proba(X_test_balanced_smote)[:, 1]  # Probabilities for the positive class

# Convert probabilities to binary predictions (using a threshold, e.g., 0.5)
y_pred_balanced_smote = (y_pred_proba_balanced_smote > 0.5).astype(int)

# Create confusion matrix
cm = confusion_matrix(y_true_balanced_smote, y_pred_balanced_smote)

# Display the confusion matrix
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

Checking the models with the Domain variables inclusion.¶

Combining the polynomial variables and domain variables to check the performance, check with the high performing models, random forest, Decision tree and bagging.

image.png image-2.png

In [437]:
oversample_poly = SMOTE()

imputer = SimpleImputer(strategy="median")
# adding the domain variable also to the poly feature dataset
data_app_train_poly['CREDIT_INCOME_PERCENT'] = app_train_domain['CREDIT_INCOME_PERCENT']
data_app_train_poly['ANNUITY_INCOME_PERCENT'] = app_train_domain['ANNUITY_INCOME_PERCENT']
data_app_train_poly['CREDIT_TERM'] = app_train_domain['CREDIT_TERM']
data_app_train_poly['DAYS_EMPLOYED_PERCENT'] = app_train_domain['DAYS_EMPLOYED_PERCENT']
# treating the missing values
data_app_train_poly = imputer.fit_transform(data_app_train_poly)
# Oversampling
X_poly, y_poly = oversample_poly.fit_resample(data_app_train_poly, target)
print("shape of the new datset" , data_app_train_poly.shape)

verifyBalance(y_poly)

# splitting the dataset
X_train_poly, X_test_poly, y_train_poly, y_test_poly = train_test_split(X_poly, y_poly, test_size=0.4, random_state=42)
X_train_poly, X_valid_poly, y_train_poly, y_valid_poly = train_test_split(X_train_poly, y_train_poly, test_size=0.15, random_state=42)
shape of the new datset (307511, 277)

Experiment 12 - Decisontree with Polynomial Features + DomainFeatures¶

In [331]:
newpipeline = Pipeline([
    ("clf", DecisionTreeClassifier())
])

start = time()
model = newpipeline.fit(X_train_poly, y_train_poly)
np.random.seed(42)

# Set up cross validation scores 
logit_scores = cross_val_score(newpipeline, X_train_poly , y_train_poly, cv=3)               
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)

# Time and score test predictions
start = time()
logit_score_test  = newpipeline.score(X_test_poly, y_test_poly)
test_time = np.round(time() - start, 4)
In [332]:
exp_name = f"Decisontree with Polynomial Features + DomainFeatures"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [logit_score_train, 
                pct(accuracy_score(y_valid_poly, model.predict(X_valid_poly))),
                pct(accuracy_score(y_test_poly, model.predict(X_test_poly))),
                roc_auc_score(y_train_poly, model.predict_proba(X_train_poly)[:, 1]),
                roc_auc_score(y_valid_poly, model.predict_proba(X_valid_poly)[:, 1]),
                roc_auc_score(y_test_poly, model.predict_proba(X_test_poly)[:, 1]),
                f1_score(y_train_poly, model.predict(X_train_poly)),
                f1_score(y_test_poly, model.predict(X_test_poly)),
                log_loss(y_train_poly, model.predict(X_train_poly)),
                log_loss(y_test_poly, model.predict(X_test_poly)),0 ],4)) \
                + [train_time,test_time] + [f"Decisontree with Polynomial Features + DomainFeatures"]
expLog
Out[332]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Test F1 Score Train Log Loss Test Log Loss P Score Train Time Test Time Description
0 Baseline_38_features 91.809 91.880 91.815 0.7412 0.7362 0.7370 0.0139 0.0133 2.9319 2.9501 0.0 21.4906 0.1135 Imbalanced Logistic reg with 20% training data
1 Baseline_38_features 71.724 84.586 84.658 0.7433 0.7367 0.7365 0.4790 0.2823 10.0659 5.5299 0.0 4.5632 0.1214 Balanced Logistic reg with 30% training data
2 Baseline_advanced_features 91.875 91.855 92.022 0.7463 0.7474 0.7460 0.0220 0.0187 2.9258 2.8756 0.0 173.1410 0.0576 experiment 3 -> Imbalanced Logistic reg with ...
3 Baseline_advanced_features_with_DT 85.088 85.178 85.378 1.0000 0.5393 0.5377 1.0000 0.1498 0.0000 5.2702 0.0 94.3183 0.1299 experiment 3 -> Imbalanced DecisionTree with ...
4 Baseline_advanced_features_with_randmforest 91.881 91.861 92.045 1.0000 0.7178 0.7116 0.9998 0.0016 0.0014 2.8673 0.0 424.2478 5.6705 Imbalanced randomforest with advanced features
5 Baseline_advanced_features_with_bagging 91.855 91.889 91.986 1.0000 0.6935 0.6904 0.9948 0.0307 0.0301 2.8884 0.0 3266.9368 7.8258 Imbalanced bagging with advanced features
6 Baseline_advanced_features_with_boosting 91.823 91.846 92.000 0.8621 0.7506 0.7468 0.1608 0.0611 2.7114 2.8834 0.0 57.8918 0.1648 Imbalanced boosting with advanced features
7 Oversampled LogisticRegression_with_advanced_f... 70.673 71.060 70.587 0.7745 0.7777 0.7732 0.7091 0.7083 10.5552 10.6016 0.0 33.6503 0.2514 Oversampled LogisticRegression_with_advanced_f...
8 Oversampled_DecisionTree_with_advanced_features 88.913 89.250 89.282 1.0000 0.8925 0.8928 1.0000 0.8938 0.0000 3.8632 0.0 111.9326 0.4028 Oversampled_DecisionTree_with_advanced_features
9 Oversampled_RandomForest_with_advanced_features 94.578 95.048 95.144 1.0000 0.9827 0.9824 1.0000 0.9493 0.0000 1.7503 0.0 765.1550 11.8012 Oversampled_RandomForest_with_advanced_features
10 Oversampled_BaggingClassifier_with_advanced_fe... 95.432 95.458 95.500 0.9856 0.9773 0.9774 0.9563 0.9531 1.5117 1.6219 0.0 114.6474 0.6990 Oversampled_BaggingClassifier_with_advanced_fe...
11 Decisontree with Polynomial Features + DomainF... 90.581 90.940 90.967 1.0000 0.9094 0.9097 1.0000 0.9103 0.0000 3.2560 0.0 215.9402 0.2848 Decisontree with Polynomial Features + DomainF...

ROC Curve for experiment 12¶

In [333]:
from sklearn.metrics import roc_curve, auc
fpr , tpr , thresholds   = roc_curve(y_test_poly,model.predict_proba(X_test_poly)[:, 1])
fig = plt.figure(figsize=(10,8))
ax  = fig.add_subplot(111)
ax.plot(fpr,tpr,label   = ["Area under curve : ",auc(fpr,tpr)],linewidth=2,linestyle="dotted")
ax.plot([0,1],[0,1],linewidth=2,linestyle="dashed")
plt.legend(loc="best")
plt.title("ROC-CURVE & AREA UNDER CURVE")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
Out[333]:
Text(0, 0.5, 'True Positive Rate')

Confusion Matrix for Experiment 12¶

In [334]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming y_test is your true labels and model is your trained classifier
y_true_poly = y_test_poly
y_pred_proba_poly = model.predict_proba(X_test_poly)[:, 1]  # Probabilities for the positive class

# Convert probabilities to binary predictions (using a threshold, e.g., 0.5)
y_pred_poly = (y_pred_proba_poly > 0.5).astype(int)

# Create confusion matrix
cm = confusion_matrix(y_true_poly, y_pred_poly)

# Display the confusion matrix
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

Experiment 13 - RandomForest with Polynomial Features + DomainFeatures¶

In [335]:
newpipeline = Pipeline([
    ("clf", RandomForestClassifier(n_estimators = 100))
])

start = time()
model = newpipeline.fit(X_train_poly, y_train_poly)
np.random.seed(42)

# Set up cross validation scores 
logit_scores = cross_val_score(newpipeline, X_train_poly , y_train_poly, cv=3)               
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)

# Time and score test predictions
start = time()
logit_score_test  = newpipeline.score(X_test_poly, y_test_poly)
test_time = np.round(time() - start, 4)
In [336]:
exp_name = f"RandomForest with Polynomial Features + DomainFeatures"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [logit_score_train, 
                pct(accuracy_score(y_valid_poly, model.predict(X_valid_poly))),
                pct(accuracy_score(y_test_poly, model.predict(X_test_poly))),
                roc_auc_score(y_train_poly, model.predict_proba(X_train_poly)[:, 1]),
                roc_auc_score(y_valid_poly, model.predict_proba(X_valid_poly)[:, 1]),
                roc_auc_score(y_test_poly, model.predict_proba(X_test_poly)[:, 1]),
                f1_score(y_train_poly, model.predict(X_train_poly)),
                f1_score(y_test_poly, model.predict(X_test_poly)),
                log_loss(y_train_poly, model.predict(X_train_poly)),
                log_loss(y_test_poly, model.predict(X_test_poly)),0 ],4)) \
                + [train_time,test_time] + [f"RandomForest with Polynomial Features + DomainFeatures"]
expLog
Out[336]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Test F1 Score Train Log Loss Test Log Loss P Score Train Time Test Time Description
0 Baseline_38_features 91.809 91.880 91.815 0.7412 0.7362 0.7370 0.0139 0.0133 2.9319 2.9501 0.0 21.4906 0.1135 Imbalanced Logistic reg with 20% training data
1 Baseline_38_features 71.724 84.586 84.658 0.7433 0.7367 0.7365 0.4790 0.2823 10.0659 5.5299 0.0 4.5632 0.1214 Balanced Logistic reg with 30% training data
2 Baseline_advanced_features 91.875 91.855 92.022 0.7463 0.7474 0.7460 0.0220 0.0187 2.9258 2.8756 0.0 173.1410 0.0576 experiment 3 -> Imbalanced Logistic reg with ...
3 Baseline_advanced_features_with_DT 85.088 85.178 85.378 1.0000 0.5393 0.5377 1.0000 0.1498 0.0000 5.2702 0.0 94.3183 0.1299 experiment 3 -> Imbalanced DecisionTree with ...
4 Baseline_advanced_features_with_randmforest 91.881 91.861 92.045 1.0000 0.7178 0.7116 0.9998 0.0016 0.0014 2.8673 0.0 424.2478 5.6705 Imbalanced randomforest with advanced features
5 Baseline_advanced_features_with_bagging 91.855 91.889 91.986 1.0000 0.6935 0.6904 0.9948 0.0307 0.0301 2.8884 0.0 3266.9368 7.8258 Imbalanced bagging with advanced features
6 Baseline_advanced_features_with_boosting 91.823 91.846 92.000 0.8621 0.7506 0.7468 0.1608 0.0611 2.7114 2.8834 0.0 57.8918 0.1648 Imbalanced boosting with advanced features
7 Oversampled LogisticRegression_with_advanced_f... 70.673 71.060 70.587 0.7745 0.7777 0.7732 0.7091 0.7083 10.5552 10.6016 0.0 33.6503 0.2514 Oversampled LogisticRegression_with_advanced_f...
8 Oversampled_DecisionTree_with_advanced_features 88.913 89.250 89.282 1.0000 0.8925 0.8928 1.0000 0.8938 0.0000 3.8632 0.0 111.9326 0.4028 Oversampled_DecisionTree_with_advanced_features
9 Oversampled_RandomForest_with_advanced_features 94.578 95.048 95.144 1.0000 0.9827 0.9824 1.0000 0.9493 0.0000 1.7503 0.0 765.1550 11.8012 Oversampled_RandomForest_with_advanced_features
10 Oversampled_BaggingClassifier_with_advanced_fe... 95.432 95.458 95.500 0.9856 0.9773 0.9774 0.9563 0.9531 1.5117 1.6219 0.0 114.6474 0.6990 Oversampled_BaggingClassifier_with_advanced_fe...
11 Decisontree with Polynomial Features + DomainF... 90.581 90.940 90.967 1.0000 0.9094 0.9097 1.0000 0.9103 0.0000 3.2560 0.0 215.9402 0.2848 Decisontree with Polynomial Features + DomainF...
12 RandomForest with Polynomial Features + Domain... 95.468 95.543 95.572 1.0000 0.9796 0.9793 1.0000 0.9537 0.0003 1.5959 0.0 1250.2992 10.2486 RandomForest with Polynomial Features + Domain...

ROC Curve for experiment 13¶

In [337]:
from sklearn.metrics import roc_curve, auc
fpr , tpr , thresholds   = roc_curve(y_test_poly,model.predict_proba(X_test_poly)[:, 1])
fig = plt.figure(figsize=(10,8))
ax  = fig.add_subplot(111)
ax.plot(fpr,tpr,label   = ["Area under curve : ",auc(fpr,tpr)],linewidth=2,linestyle="dotted")
ax.plot([0,1],[0,1],linewidth=2,linestyle="dashed")
plt.legend(loc="best")
plt.title("ROC-CURVE & AREA UNDER CURVE")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
Out[337]:
Text(0, 0.5, 'True Positive Rate')

Confusion Matrix for Experiment 13¶

In [338]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming y_test is your true labels and model is your trained classifier
y_true_poly = y_test_poly
y_pred_proba_poly = model.predict_proba(X_test_poly)[:, 1]  # Probabilities for the positive class

# Convert probabilities to binary predictions (using a threshold, e.g., 0.5)
y_pred_poly = (y_pred_proba_poly > 0.5).astype(int)

# Create confusion matrix
cm = confusion_matrix(y_true_poly, y_pred_poly)

# Display the confusion matrix
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

Experiment 14 - Xgboost with Polynomial Features + DomainFeatures¶

In [339]:
newpipeline = Pipeline([
    ("clf", XGBClassifier(n_estimators=100))
])

start = time()
model = newpipeline.fit(X_train_poly, y_train_poly)
np.random.seed(42)

# Set up cross validation scores 
logit_scores = cross_val_score(newpipeline, X_train_poly , y_train_poly, cv=3)               
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)

# Time and score test predictions
start = time()
logit_score_test  = newpipeline.score(X_test_poly, y_test_poly)
test_time = np.round(time() - start, 4)
In [340]:
exp_name = f"Boosting with Polynomial Features + DomainFeatures"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [logit_score_train, 
                pct(accuracy_score(y_valid_poly, model.predict(X_valid_poly))),
                pct(accuracy_score(y_test_poly, model.predict(X_test_poly))),
                roc_auc_score(y_train_poly, model.predict_proba(X_train_poly)[:, 1]),
                roc_auc_score(y_valid_poly, model.predict_proba(X_valid_poly)[:, 1]),
                roc_auc_score(y_test_poly, model.predict_proba(X_test_poly)[:, 1]),
                f1_score(y_train_poly, model.predict(X_train_poly)),
                f1_score(y_test_poly, model.predict(X_test_poly)),
                log_loss(y_train_poly, model.predict(X_train_poly)),
                log_loss(y_test_poly, model.predict(X_test_poly)),0 ],4)) \
                + [train_time,test_time] + [f"Boosting with Polynomial Features + DomainFeatures"]
expLog
Out[340]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Test F1 Score Train Log Loss Test Log Loss P Score Train Time Test Time Description
0 Baseline_38_features 91.809 91.880 91.815 0.7412 0.7362 0.7370 0.0139 0.0133 2.9319 2.9501 0.0 21.4906 0.1135 Imbalanced Logistic reg with 20% training data
1 Baseline_38_features 71.724 84.586 84.658 0.7433 0.7367 0.7365 0.4790 0.2823 10.0659 5.5299 0.0 4.5632 0.1214 Balanced Logistic reg with 30% training data
2 Baseline_advanced_features 91.875 91.855 92.022 0.7463 0.7474 0.7460 0.0220 0.0187 2.9258 2.8756 0.0 173.1410 0.0576 experiment 3 -> Imbalanced Logistic reg with ...
3 Baseline_advanced_features_with_DT 85.088 85.178 85.378 1.0000 0.5393 0.5377 1.0000 0.1498 0.0000 5.2702 0.0 94.3183 0.1299 experiment 3 -> Imbalanced DecisionTree with ...
4 Baseline_advanced_features_with_randmforest 91.881 91.861 92.045 1.0000 0.7178 0.7116 0.9998 0.0016 0.0014 2.8673 0.0 424.2478 5.6705 Imbalanced randomforest with advanced features
5 Baseline_advanced_features_with_bagging 91.855 91.889 91.986 1.0000 0.6935 0.6904 0.9948 0.0307 0.0301 2.8884 0.0 3266.9368 7.8258 Imbalanced bagging with advanced features
6 Baseline_advanced_features_with_boosting 91.823 91.846 92.000 0.8621 0.7506 0.7468 0.1608 0.0611 2.7114 2.8834 0.0 57.8918 0.1648 Imbalanced boosting with advanced features
7 Oversampled LogisticRegression_with_advanced_f... 70.673 71.060 70.587 0.7745 0.7777 0.7732 0.7091 0.7083 10.5552 10.6016 0.0 33.6503 0.2514 Oversampled LogisticRegression_with_advanced_f...
8 Oversampled_DecisionTree_with_advanced_features 88.913 89.250 89.282 1.0000 0.8925 0.8928 1.0000 0.8938 0.0000 3.8632 0.0 111.9326 0.4028 Oversampled_DecisionTree_with_advanced_features
9 Oversampled_RandomForest_with_advanced_features 94.578 95.048 95.144 1.0000 0.9827 0.9824 1.0000 0.9493 0.0000 1.7503 0.0 765.1550 11.8012 Oversampled_RandomForest_with_advanced_features
10 Oversampled_BaggingClassifier_with_advanced_fe... 95.432 95.458 95.500 0.9856 0.9773 0.9774 0.9563 0.9531 1.5117 1.6219 0.0 114.6474 0.6990 Oversampled_BaggingClassifier_with_advanced_fe...
11 Decisontree with Polynomial Features + DomainF... 90.581 90.940 90.967 1.0000 0.9094 0.9097 1.0000 0.9103 0.0000 3.2560 0.0 215.9402 0.2848 Decisontree with Polynomial Features + DomainF...
12 RandomForest with Polynomial Features + Domain... 95.468 95.543 95.572 1.0000 0.9796 0.9793 1.0000 0.9537 0.0003 1.5959 0.0 1250.2992 10.2486 RandomForest with Polynomial Features + Domain...
13 Boosting with Polynomial Features + DomainFeat... 95.516 95.559 95.560 0.9883 0.9783 0.9781 0.9575 0.9537 1.4719 1.6003 0.0 186.6716 0.3636 Boosting with Polynomial Features + DomainFeat...

Roc Curve for Experiment 14¶

In [341]:
from sklearn.metrics import roc_curve, auc
fpr , tpr , thresholds   = roc_curve(y_test_poly,model.predict_proba(X_test_poly)[:, 1])
fig = plt.figure(figsize=(10,8))
ax  = fig.add_subplot(111)
ax.plot(fpr,tpr,label   = ["Area under curve : ",auc(fpr,tpr)],linewidth=2,linestyle="dotted")
ax.plot([0,1],[0,1],linewidth=2,linestyle="dashed")
plt.legend(loc="best")
plt.title("ROC-CURVE & AREA UNDER CURVE")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
Out[341]:
Text(0, 0.5, 'True Positive Rate')

Confusion Matrix for Experiment 14¶

In [342]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming y_test is your true labels and model is your trained classifier
y_true_poly = y_test_poly
y_pred_proba_poly = model.predict_proba(X_test_poly)[:, 1]  # Probabilities for the positive class

# Convert probabilities to binary predictions (using a threshold, e.g., 0.5)
y_pred_poly = (y_pred_proba_poly > 0.5).astype(int)

# Create confusion matrix
cm = confusion_matrix(y_true_poly, y_pred_poly)

# Display the confusion matrix
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

Observation 11¶

After observing that XGBoost, RandomForest, and Decision Tree models have demonstrated superior performance in our analysis, we are now poised to refine our modeling approach. In order to streamline and enhance our models, we will incorporate a feature selection step using the SelectKBest method. This method allows us to carefully curate a subset of the most impactful features, enabling a more focused and interpretable model.

The rationale behind employing feature selection lies in its potential to improve model efficiency, reduce overfitting, and enhance interpretability. By narrowing down our feature set to the most relevant ones, we aim to boost the overall performance of our models and gain insights into the key factors influencing predictive accuracy.

Our next step involves applying the SelectKBest method to each of the selected algorithms – XGBoost, RandomForest, and Decision Tree. This iterative approach allows us to tailor the feature selection process to the specific characteristics and requirements of each model.

Upon completion of this refined modeling process, we will conduct a comprehensive evaluation of the models' performance metrics, assess the importance of selected features, and draw comparisons with the initial models. This holistic analysis will contribute to a deeper understanding of the interplay between feature selection and algorithmic performance, guiding us towards more informed and effective modeling decisions.

Selecting best features using SelectKbest¶

Experiment 15 - Selecting Best features with Xgboost¶

In [344]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import accuracy_score

newpipeline = Pipeline([
    ('feature_selection', SelectKBest(score_func=f_classif, k=30)),
    ("clf", XGBClassifier(n_estimators=100))
])

start = time()
model = newpipeline.fit(X_train_poly, y_train_poly)
np.random.seed(42)

# Set up cross validation scores 
logit_scores = cross_val_score(newpipeline, X_train_poly , y_train_poly, cv=3)               
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)

# Time and score test predictions
start = time()
logit_score_test  = newpipeline.score(X_test_poly, y_test_poly)
test_time = np.round(time() - start, 4)
In [346]:
exp_name = f"Kbest Features with Polynomial Features + DomainFeatures with Xgboost"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [logit_score_train, 
                pct(accuracy_score(y_valid_poly, model.predict(X_valid_poly))),
                pct(accuracy_score(y_test_poly, model.predict(X_test_poly))),
                roc_auc_score(y_train_poly, model.predict_proba(X_train_poly)[:, 1]),
                roc_auc_score(y_valid_poly, model.predict_proba(X_valid_poly)[:, 1]),
                roc_auc_score(y_test_poly, model.predict_proba(X_test_poly)[:, 1]),
                f1_score(y_train_poly, model.predict(X_train_poly)),
                f1_score(y_test_poly, model.predict(X_test_poly)),
                log_loss(y_train_poly, model.predict(X_train_poly)),
                log_loss(y_test_poly, model.predict(X_test_poly)),0 ],4)) \
                + [train_time,test_time] + [f"Kbest Features with Polynomial Features + DomainFeatures with Xgboost"]
expLog
Out[346]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Test F1 Score Train Log Loss Test Log Loss P Score Train Time Test Time Description
0 Baseline_38_features 91.809 91.880 91.815 0.7412 0.7362 0.7370 0.0139 0.0133 2.9319 2.9501 0.0 21.4906 0.1135 Imbalanced Logistic reg with 20% training data
1 Baseline_38_features 71.724 84.586 84.658 0.7433 0.7367 0.7365 0.4790 0.2823 10.0659 5.5299 0.0 4.5632 0.1214 Balanced Logistic reg with 30% training data
2 Baseline_advanced_features 91.875 91.855 92.022 0.7463 0.7474 0.7460 0.0220 0.0187 2.9258 2.8756 0.0 173.1410 0.0576 experiment 3 -> Imbalanced Logistic reg with ...
3 Baseline_advanced_features_with_DT 85.088 85.178 85.378 1.0000 0.5393 0.5377 1.0000 0.1498 0.0000 5.2702 0.0 94.3183 0.1299 experiment 3 -> Imbalanced DecisionTree with ...
4 Baseline_advanced_features_with_randmforest 91.881 91.861 92.045 1.0000 0.7178 0.7116 0.9998 0.0016 0.0014 2.8673 0.0 424.2478 5.6705 Imbalanced randomforest with advanced features
5 Baseline_advanced_features_with_bagging 91.855 91.889 91.986 1.0000 0.6935 0.6904 0.9948 0.0307 0.0301 2.8884 0.0 3266.9368 7.8258 Imbalanced bagging with advanced features
6 Baseline_advanced_features_with_boosting 91.823 91.846 92.000 0.8621 0.7506 0.7468 0.1608 0.0611 2.7114 2.8834 0.0 57.8918 0.1648 Imbalanced boosting with advanced features
7 Oversampled LogisticRegression_with_advanced_f... 70.673 71.060 70.587 0.7745 0.7777 0.7732 0.7091 0.7083 10.5552 10.6016 0.0 33.6503 0.2514 Oversampled LogisticRegression_with_advanced_f...
8 Oversampled_DecisionTree_with_advanced_features 88.913 89.250 89.282 1.0000 0.8925 0.8928 1.0000 0.8938 0.0000 3.8632 0.0 111.9326 0.4028 Oversampled_DecisionTree_with_advanced_features
9 Oversampled_RandomForest_with_advanced_features 94.578 95.048 95.144 1.0000 0.9827 0.9824 1.0000 0.9493 0.0000 1.7503 0.0 765.1550 11.8012 Oversampled_RandomForest_with_advanced_features
10 Oversampled_BaggingClassifier_with_advanced_fe... 95.432 95.458 95.500 0.9856 0.9773 0.9774 0.9563 0.9531 1.5117 1.6219 0.0 114.6474 0.6990 Oversampled_BaggingClassifier_with_advanced_fe...
11 Decisontree with Polynomial Features + DomainF... 90.581 90.940 90.967 1.0000 0.9094 0.9097 1.0000 0.9103 0.0000 3.2560 0.0 215.9402 0.2848 Decisontree with Polynomial Features + DomainF...
12 RandomForest with Polynomial Features + Domain... 95.468 95.543 95.572 1.0000 0.9796 0.9793 1.0000 0.9537 0.0003 1.5959 0.0 1250.2992 10.2486 RandomForest with Polynomial Features + Domain...
13 Boosting with Polynomial Features + DomainFeat... 95.516 95.559 95.560 0.9883 0.9783 0.9781 0.9575 0.9537 1.4719 1.6003 0.0 186.6716 0.3636 Boosting with Polynomial Features + DomainFeat...
14 Kbest Features with Polynomial Features + Doma... 91.565 91.866 91.847 0.9688 0.9630 0.9618 0.9217 0.9129 2.6465 2.9385 0.0 65.2787 0.2772 Kbest Features with Polynomial Features + Doma...

ROC Curve and Confusion matrix for Experiment 14¶

In [347]:
from sklearn.metrics import roc_curve, auc
fpr , tpr , thresholds   = roc_curve(y_test_poly,model.predict_proba(X_test_poly)[:, 1])
fig = plt.figure(figsize=(10,8))
ax  = fig.add_subplot(111)
ax.plot(fpr,tpr,label   = ["Area under curve : ",auc(fpr,tpr)],linewidth=2,linestyle="dotted")
ax.plot([0,1],[0,1],linewidth=2,linestyle="dashed")
plt.legend(loc="best")
plt.title("ROC-CURVE & AREA UNDER CURVE")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming y_test is your true labels and model is your trained classifier
y_true_poly = y_test_poly
y_pred_proba_poly = model.predict_proba(X_test_poly)[:, 1]  # Probabilities for the positive class

# Convert probabilities to binary predictions (using a threshold, e.g., 0.5)
y_pred_poly = (y_pred_proba_poly > 0.5).astype(int)

# Create confusion matrix
cm = confusion_matrix(y_true_poly, y_pred_poly)

# Display the confusion matrix
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

Experiment 16 - Selecting Best features with Decisiontree¶

In [349]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import accuracy_score

newpipeline = Pipeline([
    ('feature_selection', SelectKBest(score_func=f_classif, k=30)),
    ("clf", DecisionTreeClassifier())
])

start = time()
model = newpipeline.fit(X_train_poly, y_train_poly)
np.random.seed(42)

# Set up cross validation scores 
logit_scores = cross_val_score(newpipeline, X_train_poly , y_train_poly, cv=3)               
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)

# Time and score test predictions
start = time()
logit_score_test  = newpipeline.score(X_test_poly, y_test_poly)
test_time = np.round(time() - start, 4)


exp_name = f"Kbest Features with Polynomial Features + DomainFeatures with Decisiontree"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [logit_score_train, 
                pct(accuracy_score(y_valid_poly, model.predict(X_valid_poly))),
                pct(accuracy_score(y_test_poly, model.predict(X_test_poly))),
                roc_auc_score(y_train_poly, model.predict_proba(X_train_poly)[:, 1]),
                roc_auc_score(y_valid_poly, model.predict_proba(X_valid_poly)[:, 1]),
                roc_auc_score(y_test_poly, model.predict_proba(X_test_poly)[:, 1]),
                f1_score(y_train_poly, model.predict(X_train_poly)),
                f1_score(y_test_poly, model.predict(X_test_poly)),
                log_loss(y_train_poly, model.predict(X_train_poly)),
                log_loss(y_test_poly, model.predict(X_test_poly)),0 ],4)) \
                + [train_time,test_time] + [f"Kbest Features with Polynomial Features + DomainFeatures with DecisionTree"]
expLog
Out[349]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Test F1 Score Train Log Loss Test Log Loss P Score Train Time Test Time Description
0 Baseline_38_features 91.809 91.880 91.815 0.7412 0.7362 0.7370 0.0139 0.0133 2.9319 2.9501 0.0 21.4906 0.1135 Imbalanced Logistic reg with 20% training data
1 Baseline_38_features 71.724 84.586 84.658 0.7433 0.7367 0.7365 0.4790 0.2823 10.0659 5.5299 0.0 4.5632 0.1214 Balanced Logistic reg with 30% training data
2 Baseline_advanced_features 91.875 91.855 92.022 0.7463 0.7474 0.7460 0.0220 0.0187 2.9258 2.8756 0.0 173.1410 0.0576 experiment 3 -> Imbalanced Logistic reg with ...
3 Baseline_advanced_features_with_DT 85.088 85.178 85.378 1.0000 0.5393 0.5377 1.0000 0.1498 0.0000 5.2702 0.0 94.3183 0.1299 experiment 3 -> Imbalanced DecisionTree with ...
4 Baseline_advanced_features_with_randmforest 91.881 91.861 92.045 1.0000 0.7178 0.7116 0.9998 0.0016 0.0014 2.8673 0.0 424.2478 5.6705 Imbalanced randomforest with advanced features
5 Baseline_advanced_features_with_bagging 91.855 91.889 91.986 1.0000 0.6935 0.6904 0.9948 0.0307 0.0301 2.8884 0.0 3266.9368 7.8258 Imbalanced bagging with advanced features
6 Baseline_advanced_features_with_boosting 91.823 91.846 92.000 0.8621 0.7506 0.7468 0.1608 0.0611 2.7114 2.8834 0.0 57.8918 0.1648 Imbalanced boosting with advanced features
7 Oversampled LogisticRegression_with_advanced_f... 70.673 71.060 70.587 0.7745 0.7777 0.7732 0.7091 0.7083 10.5552 10.6016 0.0 33.6503 0.2514 Oversampled LogisticRegression_with_advanced_f...
8 Oversampled_DecisionTree_with_advanced_features 88.913 89.250 89.282 1.0000 0.8925 0.8928 1.0000 0.8938 0.0000 3.8632 0.0 111.9326 0.4028 Oversampled_DecisionTree_with_advanced_features
9 Oversampled_RandomForest_with_advanced_features 94.578 95.048 95.144 1.0000 0.9827 0.9824 1.0000 0.9493 0.0000 1.7503 0.0 765.1550 11.8012 Oversampled_RandomForest_with_advanced_features
10 Oversampled_BaggingClassifier_with_advanced_fe... 95.432 95.458 95.500 0.9856 0.9773 0.9774 0.9563 0.9531 1.5117 1.6219 0.0 114.6474 0.6990 Oversampled_BaggingClassifier_with_advanced_fe...
11 Decisontree with Polynomial Features + DomainF... 90.581 90.940 90.967 1.0000 0.9094 0.9097 1.0000 0.9103 0.0000 3.2560 0.0 215.9402 0.2848 Decisontree with Polynomial Features + DomainF...
12 RandomForest with Polynomial Features + Domain... 95.468 95.543 95.572 1.0000 0.9796 0.9793 1.0000 0.9537 0.0003 1.5959 0.0 1250.2992 10.2486 RandomForest with Polynomial Features + Domain...
13 Boosting with Polynomial Features + DomainFeat... 95.516 95.559 95.560 0.9883 0.9783 0.9781 0.9575 0.9537 1.4719 1.6003 0.0 186.6716 0.3636 Boosting with Polynomial Features + DomainFeat...
14 Kbest Features with Polynomial Features + Doma... 91.565 91.866 91.847 0.9688 0.9630 0.9618 0.9217 0.9129 2.6465 2.9385 0.0 65.2787 0.2772 Kbest Features with Polynomial Features + Doma...
15 Kbest Features with Polynomial Features + Doma... 81.843 83.325 83.250 1.0000 0.8333 0.8325 1.0000 0.8323 0.0000 6.0373 0.0 113.8591 0.2591 Kbest Features with Polynomial Features + Doma...

RoC Curve and Confusion Marix for Experiment 16¶

In [350]:
from sklearn.metrics import roc_curve, auc
fpr , tpr , thresholds   = roc_curve(y_test_poly,model.predict_proba(X_test_poly)[:, 1])
fig = plt.figure(figsize=(10,8))
ax  = fig.add_subplot(111)
ax.plot(fpr,tpr,label   = ["Area under curve : ",auc(fpr,tpr)],linewidth=2,linestyle="dotted")
ax.plot([0,1],[0,1],linewidth=2,linestyle="dashed")
plt.legend(loc="best")
plt.title("ROC-CURVE & AREA UNDER CURVE")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming y_test is your true labels and model is your trained classifier
y_true_poly = y_test_poly
y_pred_proba_poly = model.predict_proba(X_test_poly)[:, 1]  # Probabilities for the positive class

# Convert probabilities to binary predictions (using a threshold, e.g., 0.5)
y_pred_poly = (y_pred_proba_poly > 0.5).astype(int)

# Create confusion matrix
cm = confusion_matrix(y_true_poly, y_pred_poly)

# Display the confusion matrix
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

Experiment 16 - Selecting Best features with RandomForest¶

In [352]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import accuracy_score

newpipeline = Pipeline([
    ('feature_selection', SelectKBest(score_func=f_classif, k=30)),
    ("clf", RandomForestClassifier(n_jobs=4))
])

start = time()
model = newpipeline.fit(X_train_poly, y_train_poly)
np.random.seed(42)

# Set up cross validation scores 
logit_scores = cross_val_score(newpipeline, X_train_poly , y_train_poly, cv=3)               
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)

# Time and score test predictions
start = time()
logit_score_test  = newpipeline.score(X_test_poly, y_test_poly)
test_time = np.round(time() - start, 4)


exp_name = f"Kbest Features with Polynomial Features + DomainFeatures with RandomForest"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [logit_score_train, 
                pct(accuracy_score(y_valid_poly, model.predict(X_valid_poly))),
                pct(accuracy_score(y_test_poly, model.predict(X_test_poly))),
                roc_auc_score(y_train_poly, model.predict_proba(X_train_poly)[:, 1]),
                roc_auc_score(y_valid_poly, model.predict_proba(X_valid_poly)[:, 1]),
                roc_auc_score(y_test_poly, model.predict_proba(X_test_poly)[:, 1]),
                f1_score(y_train_poly, model.predict(X_train_poly)),
                f1_score(y_test_poly, model.predict(X_test_poly)),
                log_loss(y_train_poly, model.predict(X_train_poly)),
                log_loss(y_test_poly, model.predict(X_test_poly)),0 ],4)) \
                + [train_time,test_time] + [f"Kbest Features with Polynomial Features + DomainFeatures with RandomForest"]
expLog
Out[352]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Test F1 Score Train Log Loss Test Log Loss P Score Train Time Test Time Description
0 Baseline_38_features 91.809 91.880 91.815 0.7412 0.7362 0.7370 0.0139 0.0133 2.9319 2.9501 0.0 21.4906 0.1135 Imbalanced Logistic reg with 20% training data
1 Baseline_38_features 71.724 84.586 84.658 0.7433 0.7367 0.7365 0.4790 0.2823 10.0659 5.5299 0.0 4.5632 0.1214 Balanced Logistic reg with 30% training data
2 Baseline_advanced_features 91.875 91.855 92.022 0.7463 0.7474 0.7460 0.0220 0.0187 2.9258 2.8756 0.0 173.1410 0.0576 experiment 3 -> Imbalanced Logistic reg with ...
3 Baseline_advanced_features_with_DT 85.088 85.178 85.378 1.0000 0.5393 0.5377 1.0000 0.1498 0.0000 5.2702 0.0 94.3183 0.1299 experiment 3 -> Imbalanced DecisionTree with ...
4 Baseline_advanced_features_with_randmforest 91.881 91.861 92.045 1.0000 0.7178 0.7116 0.9998 0.0016 0.0014 2.8673 0.0 424.2478 5.6705 Imbalanced randomforest with advanced features
5 Baseline_advanced_features_with_bagging 91.855 91.889 91.986 1.0000 0.6935 0.6904 0.9948 0.0307 0.0301 2.8884 0.0 3266.9368 7.8258 Imbalanced bagging with advanced features
6 Baseline_advanced_features_with_boosting 91.823 91.846 92.000 0.8621 0.7506 0.7468 0.1608 0.0611 2.7114 2.8834 0.0 57.8918 0.1648 Imbalanced boosting with advanced features
7 Oversampled LogisticRegression_with_advanced_f... 70.673 71.060 70.587 0.7745 0.7777 0.7732 0.7091 0.7083 10.5552 10.6016 0.0 33.6503 0.2514 Oversampled LogisticRegression_with_advanced_f...
8 Oversampled_DecisionTree_with_advanced_features 88.913 89.250 89.282 1.0000 0.8925 0.8928 1.0000 0.8938 0.0000 3.8632 0.0 111.9326 0.4028 Oversampled_DecisionTree_with_advanced_features
9 Oversampled_RandomForest_with_advanced_features 94.578 95.048 95.144 1.0000 0.9827 0.9824 1.0000 0.9493 0.0000 1.7503 0.0 765.1550 11.8012 Oversampled_RandomForest_with_advanced_features
10 Oversampled_BaggingClassifier_with_advanced_fe... 95.432 95.458 95.500 0.9856 0.9773 0.9774 0.9563 0.9531 1.5117 1.6219 0.0 114.6474 0.6990 Oversampled_BaggingClassifier_with_advanced_fe...
11 Decisontree with Polynomial Features + DomainF... 90.581 90.940 90.967 1.0000 0.9094 0.9097 1.0000 0.9103 0.0000 3.2560 0.0 215.9402 0.2848 Decisontree with Polynomial Features + DomainF...
12 RandomForest with Polynomial Features + Domain... 95.468 95.543 95.572 1.0000 0.9796 0.9793 1.0000 0.9537 0.0003 1.5959 0.0 1250.2992 10.2486 RandomForest with Polynomial Features + Domain...
13 Boosting with Polynomial Features + DomainFeat... 95.516 95.559 95.560 0.9883 0.9783 0.9781 0.9575 0.9537 1.4719 1.6003 0.0 186.6716 0.3636 Boosting with Polynomial Features + DomainFeat...
14 Kbest Features with Polynomial Features + Doma... 91.565 91.866 91.847 0.9688 0.9630 0.9618 0.9217 0.9129 2.6465 2.9385 0.0 65.2787 0.2772 Kbest Features with Polynomial Features + Doma...
15 Kbest Features with Polynomial Features + Doma... 81.843 83.325 83.250 1.0000 0.8333 0.8325 1.0000 0.8323 0.0000 6.0373 0.0 113.8591 0.2591 Kbest Features with Polynomial Features + Doma...
16 Kbest Features with Polynomial Features + Doma... 85.676 87.104 87.128 1.0000 0.9420 0.9416 1.0000 0.8685 0.0005 4.6397 0.0 480.8479 2.9291 Kbest Features with Polynomial Features + Doma...

ROC Curve and ConfusinMatrix with RandomForest¶

In [353]:
from sklearn.metrics import roc_curve, auc
fpr , tpr , thresholds   = roc_curve(y_test_poly,model.predict_proba(X_test_poly)[:, 1])
fig = plt.figure(figsize=(10,8))
ax  = fig.add_subplot(111)
ax.plot(fpr,tpr,label   = ["Area under curve : ",auc(fpr,tpr)],linewidth=2,linestyle="dotted")
ax.plot([0,1],[0,1],linewidth=2,linestyle="dashed")
plt.legend(loc="best")
plt.title("ROC-CURVE & AREA UNDER CURVE")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming y_test is your true labels and model is your trained classifier
y_true_poly = y_test_poly
y_pred_proba_poly = model.predict_proba(X_test_poly)[:, 1]  # Probabilities for the positive class

# Convert probabilities to binary predictions (using a threshold, e.g., 0.5)
y_pred_poly = (y_pred_proba_poly > 0.5).astype(int)

# Create confusion matrix
cm = confusion_matrix(y_true_poly, y_pred_poly)

# Display the confusion matrix
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

Future Phase Scope¶

Hyperparameter tuning involves finding the optimal set of hyperparameters for a machine learning model to improve its performance. When using the SelectKBest method, the hyperparameter to be tuned is 'k,' which represents the number of features selected. Here's some information on hyperparameter tuning for the best 'k':

Importance of Tuning 'k':¶

  1. Impact on Model Performance:

    • The choice of 'k' directly affects the number of features used by the model. Too few features may result in underfitting, while too many may lead to overfitting. Tuning 'k' helps strike a balance for better model generalization.
  2. Computational Efficiency:

    • Selecting an optimal 'k' can improve computational efficiency by reducing the dimensionality of the dataset. This is crucial, especially when dealing with high-dimensional data.
  3. Interpretability:

    • A smaller set of features enhances the interpretability of the model. Tuning 'k' allows for the identification of the most influential features, aiding in better understanding and explaining model predictions.

Strategies for Tuning 'k':¶

  1. Grid Search:

    • Perform a grid search over a range of possible 'k' values, evaluating the model's performance for each. This exhaustive search helps identify the 'k' that maximizes the chosen performance metric.

      from sklearn.model_selection import GridSearchCV
      
      param_grid = {'feature_selection__k': [5, 10, 15, 20]}  # Adjust the range
      grid_search = GridSearchCV(pipeline, param_grid=param_grid, scoring='accuracy', cv=5)
      grid_search.fit(X_train, y_train)
      best_k = grid_search.best_params_['feature_selection__k']
      
  2. Random Search:

    • Randomly sample 'k' values from a predefined range. This approach can be more efficient than grid search and is especially beneficial when the search space is large.

      from sklearn.model_selection import RandomizedSearchCV
      
      param_dist = {'feature_selection__k': [5, 10, 15, 20]}  # Adjust the range
      random_search = RandomizedSearchCV(pipeline, param_distributions=param_dist, n_iter=3, scoring='accuracy', cv=5)
      random_search.fit(X_train, y_train)
      best_k = random_search.best_params_['feature_selection__k']
      

Evaluation Metrics:¶

  • Choose an appropriate evaluation metric (e.g., accuracy, precision, recall, F1-score) to assess the impact of different 'k' values on model performance.
from sklearn.metrics import accuracy_score

# Example of evaluating accuracy for a specific 'k'
pipeline = Pipeline([
    ('feature_selection', SelectKBest(score_func=f_classif, k=best_k)),
    ('classifier', RandomForestClassifier(random_state=42))
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy for k={best_k}: {accuracy:.2f}")

Tuning the 'k' hyperparameter in SelectKBest is crucial for optimizing your model's performance, and the choice should be guided by a thorough search across a range of values using cross-validation. The ultimate goal is to find the 'k' that balances model complexity, interpretability, and predictive accuracy.

HyperParameter Tuning BestPerforming Algorithms¶

image.png

Hyperparameter Tuning XgBoost Classifier¶

In [360]:
from sklearn.model_selection import GridSearchCV

# Resetting the index of the DataFrame
X_train_balanced_smote_reset = X_train_balanced_smote.reset_index(drop=True)

# Example: Randomly sample 50% of the data
sampled_indices = np.random.choice(len(X_train_balanced_smote_reset), size=int(0.5 * len(X_train_balanced_smote_reset)), replace=False)
X_train_sampled = X_train_balanced_smote_reset.loc[sampled_indices]
y_train_sampled = y_train_balanced_smote.iloc[sampled_indices]

from xgboost import XGBClassifier
# deciding the parameters for tuning
parameters = {
    'n_estimators': [300, 400],
    'learning_rate': [0.1, 0.05]
}

grid_search_boost = GridSearchCV(
    estimator=XGBClassifier(objective= 'binary:logistic'),
    param_grid=parameters,
    scoring = 'recall',
    cv = 3,
    verbose=True,
    n_jobs = 3
)

grid_search_boost.fit(X_train_balanced_smote, y_train_balanced_smote)

print("Best EStimatiors : ", grid_search_boost.best_estimator_)

print("Best score : ", grid_search_boost.best_score_)
Fitting 3 folds for each of 4 candidates, totalling 12 fits
Best EStimatiors :  XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=400, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)
Best score :  0.9131333808839212

Observation 12¶

Fitting Models with Best Parameters¶

The provided information pertains to the results of a hyperparameter tuning process using cross-validation with 3 folds for each of the 4 candidate models. The best estimator identified is an XGBClassifier with a set of hyperparameters. Here are some key points to note:

  1. Best Estimator:

    • Model: XGBClassifier
    • Hyperparameters:
      • learning_rate: 0.1
      • n_estimators: 400
      • Other hyperparameters are not explicitly listed but are part of the best estimator.
  2. Best Score:

    • The best score achieved by the model is 0.9131.
  3. Observations:

    • The hyperparameter tuning process involved evaluating different combinations of hyperparameters for the XGBClassifier using cross-validation.
    • The identified best model achieved a high score of 0.9131, indicating strong predictive performance.
    • The hyperparameters selected, such as the learning rate and number of estimators, are crucial for the model's performance and were determined through the tuning process.
    • The use of 3 folds in cross-validation suggests that the model's performance was assessed by splitting the dataset into three subsets, training the model on two-thirds of the data, and validating on the remaining one-third, repeated for different splits.

Fitting Xgboost with best parameters¶

In [361]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import accuracy_score

newpipeline = Pipeline([
    ("clf", XGBClassifier(n_estimators=400, objective= 'binary:logistic', learning_rate=0.1, max_depth=10))
])

start = time()
model = newpipeline.fit(X_train_poly, y_train_poly)
np.random.seed(42)

# Set up cross validation scores 
logit_scores = cross_val_score(newpipeline, X_train_poly , y_train_poly, cv=3)               
logit_score_train = pct(logit_scores.mean())
train_time = np.round(time() - start, 4)

# Time and score test predictions
start = time()
logit_score_test  = newpipeline.score(X_test_poly, y_test_poly)
test_time = np.round(time() - start, 4)


exp_name = f"XgBoost with best Hyperparameters"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [logit_score_train, 
                pct(accuracy_score(y_valid_poly, model.predict(X_valid_poly))),
                pct(accuracy_score(y_test_poly, model.predict(X_test_poly))),
                roc_auc_score(y_train_poly, model.predict_proba(X_train_poly)[:, 1]),
                roc_auc_score(y_valid_poly, model.predict_proba(X_valid_poly)[:, 1]),
                roc_auc_score(y_test_poly, model.predict_proba(X_test_poly)[:, 1]),
                f1_score(y_train_poly, model.predict(X_train_poly)),
                f1_score(y_test_poly, model.predict(X_test_poly)),
                log_loss(y_train_poly, model.predict(X_train_poly)),
                log_loss(y_test_poly, model.predict(X_test_poly)),0 ],4)) \
                + [train_time,test_time] + [f"Best parameters n_estimators=400, learningrate=0.1"]
expLog
Out[361]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Test F1 Score Train Log Loss Test Log Loss P Score Train Time Test Time Description
0 Baseline_38_features 91.809 91.880 91.815 0.7412 0.7362 0.7370 0.0139 0.0133 2.9319 2.9501 0.0 21.4906 0.1135 Imbalanced Logistic reg with 20% training data
1 Baseline_38_features 71.724 84.586 84.658 0.7433 0.7367 0.7365 0.4790 0.2823 10.0659 5.5299 0.0 4.5632 0.1214 Balanced Logistic reg with 30% training data
2 Baseline_advanced_features 91.875 91.855 92.022 0.7463 0.7474 0.7460 0.0220 0.0187 2.9258 2.8756 0.0 173.1410 0.0576 experiment 3 -> Imbalanced Logistic reg with ...
3 Baseline_advanced_features_with_DT 85.088 85.178 85.378 1.0000 0.5393 0.5377 1.0000 0.1498 0.0000 5.2702 0.0 94.3183 0.1299 experiment 3 -> Imbalanced DecisionTree with ...
4 Baseline_advanced_features_with_randmforest 91.881 91.861 92.045 1.0000 0.7178 0.7116 0.9998 0.0016 0.0014 2.8673 0.0 424.2478 5.6705 Imbalanced randomforest with advanced features
5 Baseline_advanced_features_with_bagging 91.855 91.889 91.986 1.0000 0.6935 0.6904 0.9948 0.0307 0.0301 2.8884 0.0 3266.9368 7.8258 Imbalanced bagging with advanced features
6 Baseline_advanced_features_with_boosting 91.823 91.846 92.000 0.8621 0.7506 0.7468 0.1608 0.0611 2.7114 2.8834 0.0 57.8918 0.1648 Imbalanced boosting with advanced features
7 Oversampled LogisticRegression_with_advanced_f... 70.673 71.060 70.587 0.7745 0.7777 0.7732 0.7091 0.7083 10.5552 10.6016 0.0 33.6503 0.2514 Oversampled LogisticRegression_with_advanced_f...
8 Oversampled_DecisionTree_with_advanced_features 88.913 89.250 89.282 1.0000 0.8925 0.8928 1.0000 0.8938 0.0000 3.8632 0.0 111.9326 0.4028 Oversampled_DecisionTree_with_advanced_features
9 Oversampled_RandomForest_with_advanced_features 94.578 95.048 95.144 1.0000 0.9827 0.9824 1.0000 0.9493 0.0000 1.7503 0.0 765.1550 11.8012 Oversampled_RandomForest_with_advanced_features
10 Oversampled_BaggingClassifier_with_advanced_fe... 95.432 95.458 95.500 0.9856 0.9773 0.9774 0.9563 0.9531 1.5117 1.6219 0.0 114.6474 0.6990 Oversampled_BaggingClassifier_with_advanced_fe...
11 Decisontree with Polynomial Features + DomainF... 90.581 90.940 90.967 1.0000 0.9094 0.9097 1.0000 0.9103 0.0000 3.2560 0.0 215.9402 0.2848 Decisontree with Polynomial Features + DomainF...
12 RandomForest with Polynomial Features + Domain... 95.468 95.543 95.572 1.0000 0.9796 0.9793 1.0000 0.9537 0.0003 1.5959 0.0 1250.2992 10.2486 RandomForest with Polynomial Features + Domain...
13 Boosting with Polynomial Features + DomainFeat... 95.516 95.559 95.560 0.9883 0.9783 0.9781 0.9575 0.9537 1.4719 1.6003 0.0 186.6716 0.3636 Boosting with Polynomial Features + DomainFeat...
14 Kbest Features with Polynomial Features + Doma... 91.565 91.866 91.847 0.9688 0.9630 0.9618 0.9217 0.9129 2.6465 2.9385 0.0 65.2787 0.2772 Kbest Features with Polynomial Features + Doma...
15 Kbest Features with Polynomial Features + Doma... 81.843 83.325 83.250 1.0000 0.8333 0.8325 1.0000 0.8323 0.0000 6.0373 0.0 113.8591 0.2591 Kbest Features with Polynomial Features + Doma...
16 Kbest Features with Polynomial Features + Doma... 85.676 87.104 87.128 1.0000 0.9420 0.9416 1.0000 0.8685 0.0005 4.6397 0.0 480.8479 2.9291 Kbest Features with Polynomial Features + Doma...
17 XgBoost with best Hyperparameters 95.568 95.610 95.616 0.9998 0.9785 0.9783 0.9832 0.9543 0.5954 1.5801 0.0 1263.1159 0.8522 Best parameters n_estimators=400, learningrate...
In [363]:
from sklearn.metrics import roc_curve, auc
fpr , tpr , thresholds   = roc_curve(y_test_poly,model.predict_proba(X_test_poly)[:, 1])
fig = plt.figure(figsize=(10,8))
ax  = fig.add_subplot(111)
ax.plot(fpr,tpr,label   = ["Area under curve : ",auc(fpr,tpr)],linewidth=2,linestyle="dotted")
ax.plot([0,1],[0,1],linewidth=2,linestyle="dashed")
plt.legend(loc="best")
plt.title("ROC-CURVE & AREA UNDER CURVE")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming y_test is your true labels and model is your trained classifier
y_true_poly = y_test_poly
y_pred_proba_poly = model.predict_proba(X_test_poly)[:, 1]  # Probabilities for the positive class

# Convert probabilities to binary predictions (using a threshold, e.g., 0.5)
y_pred_poly = (y_pred_proba_poly > 0.5).astype(int)

# Create confusion matrix
cm = confusion_matrix(y_true_poly, y_pred_poly)

# Display the confusion matrix
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

Interpretation of ROC Curve¶

The image you sent is a ROC curve, which is a graphical method for evaluating the performance of a binary classifier. The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at different classification thresholds. The TPR measures the proportion of positive examples that are correctly classified, while the FPR measures the proportion of negative examples that are incorrectly classified.

In the context of fitting XGBoost with the best parameters, the ROC curve can be used to select the optimal classification threshold. The optimal classification threshold is the threshold that maximizes the TPR while minimizing the FPR. This can be achieved by finding the point on the ROC curve that is closest to the top left corner.

In the image you sent, the area under the curve (AUC) is 0.9782. This indicates that the XGBoost model is performing very well at distinguishing between positive and negative examples.

To fit XGBoost with the best parameters, you can use a variety of methods, such as grid search, random search, or Bayesian optimization. Once you have found a set of parameters that produces a good AUC on the training data, you can evaluate the model on the test data to see how well it generalizes to unseen data.

Here are some specific tips for fitting XGBoost with the best parameters:

Start with a small set of parameters and gradually increase the number of parameters as needed. Use a cross-validation scheme to evaluate the performance of the model on different subsets of the training data. Use a regularization technique such as L1 or L2 regularization to prevent the model from overfitting the training data. Use a learning rate scheduler to gradually decrease the learning rate as the model trains. Once you have found a set of parameters that produce a good AUC on the test data, you can use those parameters to train your final model.


Neural Networks for Home Credit Default Risk Prediction¶

This project explores the application of neural networks in predicting credit default risk for home loans. Home Credit Default Risk (HCDR) is a critical concern for financial institutions, and accurate prediction models can aid in making informed lending decisions. Traditional credit scoring models often fall short in capturing complex patterns within diverse datasets.

In this study, we leverage the power of neural networks, specifically deep learning architectures, to enhance the accuracy of credit risk assessment. We employ a dataset from Home Credit, consisting of various socio-economic and financial features. The neural network model is designed to automatically learn intricate relationships and dependencies within the data, allowing for more robust risk predictions.

image.png

The project includes the following key components:

  1. Data Preprocessing: Cleaning and feature engineering to prepare the dataset for neural network training.

  2. Neural Network Architecture: Designing a deep learning model tailored for credit risk prediction, with appropriate layers, activation functions, and optimization algorithms.

  3. Training and Validation: Utilizing historical data to train the neural network and validating its performance on a separate dataset to ensure generalization.

  4. Evaluation Metrics: Employing standard metrics such as accuracy, precision, recall, and the area under the ROC curve to assess the model's effectiveness.

  5. Interpretability: Exploring methods to interpret the neural network's decisions, providing insights into the factors contributing to credit default risk.

The outcomes of this project aim to contribute to the development of more sophisticated and accurate credit risk models, potentially improving the decision-making processes for financial institutions in the context of home lending.


Importing previously feature engineered Data¶

In [337]:
DATA_DIR='C:/Users/tanub/Courses/AML526/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk/Phase2'
In [338]:
%%time
ds_names = ('appsTrainDF', 'X_kaggle_test')

for ds_name in ds_names:
    datasets[ds_name]= load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
appsTrainDF: shape is (307511, 705)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 705 entries, SK_ID_CURR to HAS_LIBAILITY_3
dtypes: bool(43), float64(623), int64(39)
memory usage: 1.5 GB
None
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION ... WALLSMATERIAL_MODE_Monolithic WALLSMATERIAL_MODE_Others WALLSMATERIAL_MODE_Panel WALLSMATERIAL_MODE_Stone, brick WALLSMATERIAL_MODE_Wooden EMERGENCYSTATE_MODE_No HAS_LIBAILITY_0 HAS_LIBAILITY_1 HAS_LIBAILITY_2 HAS_LIBAILITY_3
0 100002 1 0 202500.0 406597.5 24700.5 0.018801 -9461 -637 -3648.0 ... False False False True False True False True False False
1 100003 0 0 270000.0 1293502.5 35698.5 0.003541 -16765 -1188 -1186.0 ... False False False False False True False False False True
2 100004 0 0 67500.0 135000.0 6750.0 0.010032 -19046 -225 -4260.0 ... False False True False False True True False False False
3 100006 0 0 135000.0 312682.5 29686.5 0.008019 -19005 -3039 -9833.0 ... False False True False False True False True False False
4 100007 0 0 121500.0 513000.0 21865.5 0.028663 -19932 -3038 -4311.0 ... False False True False False True False True False False

5 rows × 705 columns

X_kaggle_test: shape is (48744, 704)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 704 entries, SK_ID_CURR to HAS_LIBAILITY_3
dtypes: bool(43), float64(623), int64(38)
memory usage: 247.8 MB
None
SK_ID_CURR CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH ... WALLSMATERIAL_MODE_Monolithic WALLSMATERIAL_MODE_Others WALLSMATERIAL_MODE_Panel WALLSMATERIAL_MODE_Stone, brick WALLSMATERIAL_MODE_Wooden EMERGENCYSTATE_MODE_No HAS_LIBAILITY_0 HAS_LIBAILITY_1 HAS_LIBAILITY_2 HAS_LIBAILITY_3
0 100001 0 135000.0 568800.0 20560.5 0.018850 -19241 -2329 -5170.0 -812 ... False False False True False True False True False False
1 100005 0 99000.0 222768.0 17370.0 0.035792 -18064 -4469 -9118.0 -1623 ... False False True False False True False True False False
2 100013 0 202500.0 663264.0 69777.0 0.019101 -20038 -4458 -2175.0 -3503 ... False False True False False True True False False False
3 100028 2 315000.0 1575000.0 49018.5 0.026392 -13976 -1866 -2000.0 -4208 ... False False True False False True False True False False
4 100038 1 180000.0 625500.0 32067.0 0.010032 -13040 -2191 -4000.0 -4262 ... False False True False False True False False True False

5 rows × 704 columns

CPU times: total: 13.1 s
Wall time: 14.7 s

Preparing data for Submission and Training¶

In [339]:
X_kaggle_test=datasets['X_kaggle_test']
appsTrainDF=datasets['appsTrainDF']
In [340]:
train_dataset=appsTrainDF
class_labels = ["No Default","Default"]
In [341]:
# Create a class to select numerical or categorical columns since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values
In [342]:
num_attribs=['EXT_SOURCE_3',
 'EXT_SOURCE_2',
 'EXT_SOURCE_1',
 'OCCUPATION_TYPE_Office',
 'previous_application_NAME_CONTRACT_STATUS_Approved_mean',
 'NAME_EDUCATION_TYPE_Higher education',
 'CODE_GENDER_F',
 'previous_application_DAYS_FIRST_DRAWING_mean',
 'DAYS_EMPLOYED',
 'previous_application_DAYS_FIRST_DRAWING_min',
 'FLOORSMAX_AVG',
 'previous_application_RATE_DOWN_PAYMENT_sum',
 'previous_application_NAME_YIELD_GROUP_low_normal_mean',
 'previous_application_RATE_DOWN_PAYMENT_max',
 'previous_application_INTEREST_RT_sum',
 'previous_application_PRODUCT_COMBINATION_Cash X-Sell: low_mean',
 'REGION_POPULATION_RELATIVE',
 'previous_application_INTEREST_RT_mean',
 'previous_application_HOUR_APPR_PROCESS_START_mean',
 'previous_application_AMT_ANNUITY_mean',
 'previous_application_NAME_PAYMENT_TYPE_Cash through the bank_mean',
 'ELEVATORS_AVG',
 'previous_application_PRODUCT_COMBINATION_POS industry with interest_mean',
 'previous_application_RATE_DOWN_PAYMENT_mean',
 'previous_application_NAME_CONTRACT_TYPE_Consumer loans_mean',
 'previous_application_AMT_ANNUITY_min',
 'previous_application_DAYS_FIRST_DRAWING_count',
 'previous_application_HOUR_APPR_PROCESS_START_min',
 'previous_application_HOUR_APPR_PROCESS_START_max',
 'previous_application_PRODUCT_COMBINATION_POS industry with interest_sum',
 'AMT_CREDIT',
 'previous_application_NAME_GOODS_CATEGORY_Furniture_mean',
 'APARTMENTS_AVG',
 'previous_application_NAME_YIELD_GROUP_low_action_mean',
 'previous_application_AMT_ANNUITY_max',
 'previous_application_NAME_GOODS_CATEGORY_Furniture_sum',
 'FLAG_DOCUMENT_6',
 'NAME_HOUSING_TYPE_House / apartment',
 'previous_application_NAME_YIELD_GROUP_low_normal_sum',
 'previous_application_CREDIT_SUCCESS_sum',
 'previous_application_NAME_CLIENT_TYPE_Refreshed_mean',
 'bureau_CREDIT_TYPE_Consumer credit_mean',
 'previous_application_AMT_DOWN_PAYMENT_max',
 'previous_application_NAME_YIELD_GROUP_low_action_sum',
 'HOUR_APPR_PROCESS_START',
 'FLAG_PHONE',
 'previous_application_AMT_DOWN_PAYMENT_count',
 'NAME_INCOME_TYPE_State servant',
 'previous_application_PRODUCT_COMBINATION_Cash X-Sell: low_sum',
 'previous_application_INTEREST_PER_CREDIT_min',
 'previous_application_CHANNEL_TYPE_AP+ (Cash loan)_sum',
 'bureau_CREDIT_TYPE_Credit card_sum',
 'previous_application_CHANNEL_TYPE_AP+ (Cash loan)_mean',
 'previous_application_PRODUCT_COMBINATION_Cash X-Sell: high_sum',
 'bureau_DAYS_CREDIT_ENDDATE_max',
 'previous_application_NAME_YIELD_GROUP_high_sum',
 'previous_application_NAME_YIELD_GROUP_high_mean',
 'previous_application_NAME_PAYMENT_TYPE_XNA_sum',
 'previous_application_CODE_REJECT_REASON_LIMIT_mean',
 'previous_application_PRODUCT_COMBINATION_Card Street_mean',
 'previous_application_CODE_REJECT_REASON_LIMIT_sum',
 'DAYS_REGISTRATION',
 'bureau_DAYS_CREDIT_sum',
 'previous_application_NAME_YIELD_GROUP_XNA_mean',
 'bureau_DAYS_CREDIT_UPDATE_min',
 'FLAG_DOCUMENT_3',
 'REG_CITY_NOT_LIVE_CITY',
 'bureau_CREDIT_TYPE_Microloan_mean',
 'previous_application_NAME_CONTRACT_TYPE_Revolving loans_sum',
 'previous_application_NAME_CLIENT_TYPE_New_sum',
 'previous_application_DAYS_DECISION_mean',
 'bureau_DAYS_CREDIT_ENDDATE_mean',
 'previous_application_CODE_REJECT_REASON_HC_sum',
 'previous_application_PRODUCT_COMBINATION_Card Street_sum',
 'bureau_DAYS_CREDIT_max',
 'NAME_EDUCATION_TYPE_Secondary / secondary special',
 'REG_CITY_NOT_WORK_CITY',
 'DAYS_ID_PUBLISH',
 'bureau_DAYS_ENDDATE_FACT_mean',
 'previous_application_DAYS_DECISION_min',
 'bureau_DAYS_CREDIT_ENDDATE_sum',
 'previous_application_CODE_REJECT_REASON_HC_mean',
 'DAYS_LAST_PHONE_CHANGE',
 'previous_application_CODE_REJECT_REASON_SCOFR_mean',
 'bureau_DAYS_ENDDATE_FACT_min',
 'previous_application_CODE_REJECT_REASON_SCOFR_sum',
 'previous_application_NAME_PRODUCT_TYPE_walk-in_mean',
 'NAME_INCOME_TYPE_Working',
 'REGION_RATING_CLIENT',
 'previous_application_NAME_PRODUCT_TYPE_walk-in_sum',
 'previous_application_NAME_CONTRACT_STATUS_Refused_sum',
 'bureau_CREDIT_ACTIVE_Active_sum',
 'bureau_DAYS_CREDIT_UPDATE_mean',
 'previous_application_INTEREST_PER_CREDIT_max',
 'bureau_DAYS_CREDIT_min',
 'bureau_CREDIT_ACTIVE_Active_mean',
 'previous_application_NAME_CONTRACT_STATUS_Refused_mean',
 'DAYS_BIRTH',
 'bureau_DAYS_CREDIT_mean',
 'previous_application_INTEREST_PER_CREDIT_mean',
'previous_application_CREDIT_SUCCESS_mean',
'previous_application_INTEREST_RT_mean',
'HAS_LIBAILITY_0',
'HAS_LIBAILITY_1',
'HAS_LIBAILITY_2',
'HAS_LIBAILITY_3',
 'FLAG_DOCUMENT_2',
 'FLAG_DOCUMENT_3',
 'FLAG_DOCUMENT_4',
 'FLAG_DOCUMENT_5',
 'FLAG_DOCUMENT_6',
 'FLAG_DOCUMENT_7',
 'FLAG_DOCUMENT_8',
 'FLAG_DOCUMENT_9',
 'FLAG_DOCUMENT_10',
 'FLAG_DOCUMENT_11',
 'FLAG_DOCUMENT_12',
 'FLAG_DOCUMENT_13',
 'FLAG_DOCUMENT_14',
 'FLAG_DOCUMENT_15',
 'FLAG_DOCUMENT_16',
 'FLAG_DOCUMENT_17',
 'FLAG_DOCUMENT_18',
 'FLAG_DOCUMENT_19',
 'FLAG_DOCUMENT_20',
 'FLAG_DOCUMENT_21',
  'AMT_REQ_CREDIT_BUREAU_HOUR',
 'AMT_REQ_CREDIT_BUREAU_DAY',
 'AMT_REQ_CREDIT_BUREAU_WEEK',
 'AMT_REQ_CREDIT_BUREAU_MON',
 'AMT_REQ_CREDIT_BUREAU_QRT',
 'AMT_REQ_CREDIT_BUREAU_YEAR'
]

Num Pipeline¶

In [343]:
num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', SimpleImputer(strategy='mean')),
        ('std_scaler', StandardScaler()),
    ])
In [344]:
cat_attribs =[]

Cat Pipeline¶

In [345]:
# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
        #('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        #('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
    ])
In [346]:
data_prep_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
    #    ("cat_pipeline", cat_pipeline),
    ])      
In [347]:
selected_features = num_attribs 
tot_features = f"{len(selected_features)}:   Num:{len(num_attribs)},    Cat:{len(cat_attribs)}"
#Total Feature selected for processing
tot_features
Out[347]:
'132:   Num:132,    Cat:0'
In [348]:
gc.collect()
Out[348]:
489
In [359]:
for col in selected_features:
  if col not in  train_dataset.columns:
    selected_features.remove(col)

Splitting into train test Validation¶

In [360]:
# Split Sample to feed the pipeline and it will result in a new dataset that is (1 / splits) the size 
splits = 75

# Train Test split percentage
subsample_rate = 0.3

finaldf = np.array_split(train_dataset, splits)
X_train = finaldf[0][selected_features]
y_train = finaldf[0]['TARGET']
X_kaggle_test= X_kaggle_test[selected_features]

## split part of data
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, stratify=y_train,
                                                    test_size=subsample_rate, random_state=42)

X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,stratify=y_train,test_size=0.15, random_state=42)

print(f"X train           shape: {X_train.shape}")
print(f"X validation      shape: {X_valid.shape}")
print(f"X test            shape: {X_test.shape}")
print(f"X kaggle_test     shape: {X_kaggle_test.shape}")
X train           shape: (2439, 132)
X validation      shape: (431, 132)
X test            shape: (1231, 132)
X kaggle_test     shape: (48744, 132)
In [351]:
# roc curve, precision recall curve for each model
fprs, tprs, precisions, recalls, names, scores, cvscores, pvalues, accuracy, cnfmatrix = list(), list(), list(), list(), list(), list(), list(), list(), list(), list()
features_list, final_best_clf,results = {}, {},[]

Installing Tensor Flow¶

In [326]:
pip install tensorflow
Collecting tensorflow
  Obtaining dependency information for tensorflow from https://files.pythonhosted.org/packages/93/21/9b035a4f823d6aee2917c75415be9a95861ff3d73a0a65e48edbf210cec1/tensorflow-2.15.0-cp311-cp311-win_amd64.whl.metadata
  Downloading tensorflow-2.15.0-cp311-cp311-win_amd64.whl.metadata (3.6 kB)
Collecting tensorflow-intel==2.15.0 (from tensorflow)
  Obtaining dependency information for tensorflow-intel==2.15.0 from https://files.pythonhosted.org/packages/4c/48/1a5a15517f18eaa4ff8d598b1c000300b20c1bb0e624539d702117a0c369/tensorflow_intel-2.15.0-cp311-cp311-win_amd64.whl.metadata
  Downloading tensorflow_intel-2.15.0-cp311-cp311-win_amd64.whl.metadata (5.1 kB)
Collecting absl-py>=1.0.0 (from tensorflow-intel==2.15.0->tensorflow)
  Obtaining dependency information for absl-py>=1.0.0 from https://files.pythonhosted.org/packages/01/e4/dc0a1dcc4e74e08d7abedab278c795eef54a224363bb18f5692f416d834f/absl_py-2.0.0-py3-none-any.whl.metadata
  Downloading absl_py-2.0.0-py3-none-any.whl.metadata (2.3 kB)
Collecting astunparse>=1.6.0 (from tensorflow-intel==2.15.0->tensorflow)
  Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting flatbuffers>=23.5.26 (from tensorflow-intel==2.15.0->tensorflow)
  Obtaining dependency information for flatbuffers>=23.5.26 from https://files.pythonhosted.org/packages/6f/12/d5c79ee252793ffe845d58a913197bfa02ae9a0b5c9bc3dc4b58d477b9e7/flatbuffers-23.5.26-py2.py3-none-any.whl.metadata
  Downloading flatbuffers-23.5.26-py2.py3-none-any.whl.metadata (850 bytes)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow-intel==2.15.0->tensorflow)
  Downloading gast-0.5.4-py3-none-any.whl (19 kB)
Collecting google-pasta>=0.1.1 (from tensorflow-intel==2.15.0->tensorflow)
  Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
     ---------------------------------------- 0.0/57.5 kB ? eta -:--:--
     ---------------------------------------- 57.5/57.5 kB 1.5 MB/s eta 0:00:00
Requirement already satisfied: h5py>=2.9.0 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (3.9.0)
Collecting libclang>=13.0.0 (from tensorflow-intel==2.15.0->tensorflow)
  Obtaining dependency information for libclang>=13.0.0 from https://files.pythonhosted.org/packages/02/8c/dc970bc00867fe290e8c8a7befa1635af716a9ebdfe3fb9dce0ca4b522ce/libclang-16.0.6-py2.py3-none-win_amd64.whl.metadata
  Downloading libclang-16.0.6-py2.py3-none-win_amd64.whl.metadata (5.3 kB)
Collecting ml-dtypes~=0.2.0 (from tensorflow-intel==2.15.0->tensorflow)
  Obtaining dependency information for ml-dtypes~=0.2.0 from https://files.pythonhosted.org/packages/08/89/c727fde1a3d12586e0b8c01abf53754707d76beaa9987640e70807d4545f/ml_dtypes-0.2.0-cp311-cp311-win_amd64.whl.metadata
  Downloading ml_dtypes-0.2.0-cp311-cp311-win_amd64.whl.metadata (20 kB)
Requirement already satisfied: numpy<2.0.0,>=1.23.5 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (1.24.3)
Collecting opt-einsum>=2.3.2 (from tensorflow-intel==2.15.0->tensorflow)
  Downloading opt_einsum-3.3.0-py3-none-any.whl (65 kB)
     ---------------------------------------- 0.0/65.5 kB ? eta -:--:--
     ---------------------------------------- 65.5/65.5 kB 3.5 MB/s eta 0:00:00
Requirement already satisfied: packaging in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (23.1)
Collecting protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 (from tensorflow-intel==2.15.0->tensorflow)
  Obtaining dependency information for protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 from https://files.pythonhosted.org/packages/fe/6b/7f177e8d6fe4caa14f4065433af9f879d4fab84f0d17dcba7b407f6bd808/protobuf-4.25.1-cp310-abi3-win_amd64.whl.metadata
  Downloading protobuf-4.25.1-cp310-abi3-win_amd64.whl.metadata (541 bytes)
Requirement already satisfied: setuptools in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (68.0.0)
Requirement already satisfied: six>=1.12.0 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (1.16.0)
Collecting termcolor>=1.1.0 (from tensorflow-intel==2.15.0->tensorflow)
  Obtaining dependency information for termcolor>=1.1.0 from https://files.pythonhosted.org/packages/d9/5f/8c716e47b3a50cbd7c146f45881e11d9414def768b7cd9c5e6650ec2a80a/termcolor-2.4.0-py3-none-any.whl.metadata
  Downloading termcolor-2.4.0-py3-none-any.whl.metadata (6.1 kB)
Requirement already satisfied: typing-extensions>=3.6.6 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (4.7.1)
Requirement already satisfied: wrapt<1.15,>=1.11.0 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (1.14.1)
Collecting tensorflow-io-gcs-filesystem>=0.23.1 (from tensorflow-intel==2.15.0->tensorflow)
  Downloading tensorflow_io_gcs_filesystem-0.31.0-cp311-cp311-win_amd64.whl (1.5 MB)
     ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
     --- ------------------------------------ 0.1/1.5 MB 7.0 MB/s eta 0:00:01
     ----------- ---------------------------- 0.4/1.5 MB 5.1 MB/s eta 0:00:01
     --------------------------- ------------ 1.0/1.5 MB 8.1 MB/s eta 0:00:01
     ---------------------------------------- 1.5/1.5 MB 9.4 MB/s eta 0:00:00
Collecting grpcio<2.0,>=1.24.3 (from tensorflow-intel==2.15.0->tensorflow)
  Obtaining dependency information for grpcio<2.0,>=1.24.3 from https://files.pythonhosted.org/packages/bc/e5/f656b17fe1ccda1e2a4fe20298b8bcf7c804561c90ee763e39efc1c3772f/grpcio-1.59.3-cp311-cp311-win_amd64.whl.metadata
  Downloading grpcio-1.59.3-cp311-cp311-win_amd64.whl.metadata (4.2 kB)
Collecting tensorboard<2.16,>=2.15 (from tensorflow-intel==2.15.0->tensorflow)
  Obtaining dependency information for tensorboard<2.16,>=2.15 from https://files.pythonhosted.org/packages/6e/0c/1059a6682cf2cc1fcc0d5327837b5672fe4f5574255fa5430d0a8ceb75e9/tensorboard-2.15.1-py3-none-any.whl.metadata
  Downloading tensorboard-2.15.1-py3-none-any.whl.metadata (1.7 kB)
Collecting tensorflow-estimator<2.16,>=2.15.0 (from tensorflow-intel==2.15.0->tensorflow)
  Obtaining dependency information for tensorflow-estimator<2.16,>=2.15.0 from https://files.pythonhosted.org/packages/b6/c8/2f823c8958d5342eafc6dd3e922f0cc4fcf8c2e0460284cc462dae3b60a0/tensorflow_estimator-2.15.0-py2.py3-none-any.whl.metadata
  Downloading tensorflow_estimator-2.15.0-py2.py3-none-any.whl.metadata (1.3 kB)
Collecting keras<2.16,>=2.15.0 (from tensorflow-intel==2.15.0->tensorflow)
  Obtaining dependency information for keras<2.16,>=2.15.0 from https://files.pythonhosted.org/packages/fc/a7/0d4490de967a67f68a538cc9cdb259bff971c4b5787f7765dc7c8f118f71/keras-2.15.0-py3-none-any.whl.metadata
  Downloading keras-2.15.0-py3-none-any.whl.metadata (2.4 kB)
Requirement already satisfied: wheel<1.0,>=0.23.0 in c:\users\tanub\anaconda3\lib\site-packages (from astunparse>=1.6.0->tensorflow-intel==2.15.0->tensorflow) (0.38.4)
Collecting google-auth<3,>=1.6.3 (from tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow)
  Obtaining dependency information for google-auth<3,>=1.6.3 from https://files.pythonhosted.org/packages/ca/7e/2d41727aeba37b84e1ca515fbb2ca0d706c591ca946236466ffe575b2059/google_auth-2.24.0-py2.py3-none-any.whl.metadata
  Downloading google_auth-2.24.0-py2.py3-none-any.whl.metadata (4.7 kB)
Collecting google-auth-oauthlib<2,>=0.5 (from tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow)
  Obtaining dependency information for google-auth-oauthlib<2,>=0.5 from https://files.pythonhosted.org/packages/ce/33/a907b4b67245647746dde8d61e1643ef5d210c88e090d491efd89eff9f95/google_auth_oauthlib-1.1.0-py2.py3-none-any.whl.metadata
  Downloading google_auth_oauthlib-1.1.0-py2.py3-none-any.whl.metadata (2.7 kB)
Requirement already satisfied: markdown>=2.6.8 in c:\users\tanub\anaconda3\lib\site-packages (from tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (3.4.1)
Collecting protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 (from tensorflow-intel==2.15.0->tensorflow)
  Obtaining dependency information for protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 from https://files.pythonhosted.org/packages/80/70/dc63d340d27b8ff22022d7dd14b8d6d68b479a003eacdc4507150a286d9a/protobuf-4.23.4-cp310-abi3-win_amd64.whl.metadata
  Downloading protobuf-4.23.4-cp310-abi3-win_amd64.whl.metadata (540 bytes)
Requirement already satisfied: requests<3,>=2.21.0 in c:\users\tanub\anaconda3\lib\site-packages (from tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (2.31.0)
Collecting tensorboard-data-server<0.8.0,>=0.7.0 (from tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow)
  Obtaining dependency information for tensorboard-data-server<0.8.0,>=0.7.0 from https://files.pythonhosted.org/packages/7a/13/e503968fefabd4c6b2650af21e110aa8466fe21432cd7c43a84577a89438/tensorboard_data_server-0.7.2-py3-none-any.whl.metadata
  Downloading tensorboard_data_server-0.7.2-py3-none-any.whl.metadata (1.1 kB)
Requirement already satisfied: werkzeug>=1.0.1 in c:\users\tanub\anaconda3\lib\site-packages (from tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (2.2.3)
Collecting cachetools<6.0,>=2.0.0 (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow)
  Obtaining dependency information for cachetools<6.0,>=2.0.0 from https://files.pythonhosted.org/packages/a2/91/2d843adb9fbd911e0da45fbf6f18ca89d07a087c3daa23e955584f90ebf4/cachetools-5.3.2-py3-none-any.whl.metadata
  Downloading cachetools-5.3.2-py3-none-any.whl.metadata (5.2 kB)
Requirement already satisfied: pyasn1-modules>=0.2.1 in c:\users\tanub\anaconda3\lib\site-packages (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (0.2.8)
Collecting rsa<5,>=3.1.4 (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow)
  Downloading rsa-4.9-py3-none-any.whl (34 kB)
Collecting requests-oauthlib>=0.7.0 (from google-auth-oauthlib<2,>=0.5->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow)
  Downloading requests_oauthlib-1.3.1-py2.py3-none-any.whl (23 kB)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\tanub\anaconda3\lib\site-packages (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in c:\users\tanub\anaconda3\lib\site-packages (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\tanub\anaconda3\lib\site-packages (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\tanub\anaconda3\lib\site-packages (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (2023.11.17)
Requirement already satisfied: MarkupSafe>=2.1.1 in c:\users\tanub\anaconda3\lib\site-packages (from werkzeug>=1.0.1->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (2.1.1)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in c:\users\tanub\anaconda3\lib\site-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (0.4.8)
Collecting oauthlib>=3.0.0 (from requests-oauthlib>=0.7.0->google-auth-oauthlib<2,>=0.5->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow)
  Downloading oauthlib-3.2.2-py3-none-any.whl (151 kB)
     ---------------------------------------- 0.0/151.7 kB ? eta -:--:--
     ---------------------------------------- 151.7/151.7 kB ? eta 0:00:00
Downloading tensorflow-2.15.0-cp311-cp311-win_amd64.whl (2.1 kB)
Downloading tensorflow_intel-2.15.0-cp311-cp311-win_amd64.whl (300.9 MB)
   ---------------------------------------- 0.0/300.9 MB ? eta -:--:--
   ---------------------------------------- 1.1/300.9 MB 34.4 MB/s eta 0:00:09
   ---------------------------------------- 3.3/300.9 MB 42.8 MB/s eta 0:00:07
    --------------------------------------- 5.6/300.9 MB 44.9 MB/s eta 0:00:07
   - -------------------------------------- 8.3/300.9 MB 48.4 MB/s eta 0:00:07
   - -------------------------------------- 11.0/300.9 MB 50.1 MB/s eta 0:00:06
   - -------------------------------------- 13.7/300.9 MB 54.7 MB/s eta 0:00:06
   -- ------------------------------------- 15.8/300.9 MB 59.5 MB/s eta 0:00:05
   -- ------------------------------------- 18.2/300.9 MB 54.7 MB/s eta 0:00:06
   -- ------------------------------------- 18.4/300.9 MB 54.7 MB/s eta 0:00:06
   --- ------------------------------------ 22.6/300.9 MB 54.7 MB/s eta 0:00:06
   --- ------------------------------------ 22.6/300.9 MB 54.7 MB/s eta 0:00:06
   --- ------------------------------------ 22.8/300.9 MB 34.4 MB/s eta 0:00:09
   --- ------------------------------------ 24.0/300.9 MB 31.1 MB/s eta 0:00:09
   --- ------------------------------------ 26.3/300.9 MB 32.8 MB/s eta 0:00:09
   --- ------------------------------------ 29.3/300.9 MB 38.6 MB/s eta 0:00:08
   ---- ----------------------------------- 32.0/300.9 MB 34.4 MB/s eta 0:00:08
   ---- ----------------------------------- 34.7/300.9 MB 59.8 MB/s eta 0:00:05
   ---- ----------------------------------- 37.4/300.9 MB 59.5 MB/s eta 0:00:05
   ----- ---------------------------------- 40.2/300.9 MB 59.5 MB/s eta 0:00:05
   ----- ---------------------------------- 43.0/300.9 MB 54.4 MB/s eta 0:00:05
   ------ --------------------------------- 45.5/300.9 MB 59.5 MB/s eta 0:00:05
   ------ --------------------------------- 48.3/300.9 MB 59.5 MB/s eta 0:00:05
   ------ --------------------------------- 51.2/300.9 MB 59.5 MB/s eta 0:00:05
   ------- -------------------------------- 53.8/300.9 MB 59.5 MB/s eta 0:00:05
   ------- -------------------------------- 56.6/300.9 MB 59.8 MB/s eta 0:00:05
   ------- -------------------------------- 59.3/300.9 MB 59.5 MB/s eta 0:00:05
   -------- ------------------------------- 62.2/300.9 MB 59.5 MB/s eta 0:00:05
   -------- ------------------------------- 65.1/300.9 MB 59.5 MB/s eta 0:00:04
   -------- ------------------------------- 67.7/300.9 MB 59.5 MB/s eta 0:00:04
   --------- ------------------------------ 70.2/300.9 MB 59.5 MB/s eta 0:00:04
   --------- ------------------------------ 72.8/300.9 MB 59.5 MB/s eta 0:00:04
   --------- ------------------------------ 75.1/300.9 MB 54.4 MB/s eta 0:00:05
   ---------- ----------------------------- 77.4/300.9 MB 54.7 MB/s eta 0:00:05
   ---------- ----------------------------- 78.1/300.9 MB 54.7 MB/s eta 0:00:05
   ---------- ----------------------------- 81.1/300.9 MB 46.7 MB/s eta 0:00:05
   ---------- ----------------------------- 82.5/300.9 MB 43.7 MB/s eta 0:00:05
   ----------- ---------------------------- 84.8/300.9 MB 40.9 MB/s eta 0:00:06
   ----------- ---------------------------- 87.2/300.9 MB 43.5 MB/s eta 0:00:05
   ----------- ---------------------------- 89.9/300.9 MB 50.4 MB/s eta 0:00:05
   ------------ --------------------------- 92.2/300.9 MB 54.4 MB/s eta 0:00:04
   ------------ --------------------------- 94.6/300.9 MB 54.7 MB/s eta 0:00:04
   ------------ --------------------------- 97.0/300.9 MB 54.7 MB/s eta 0:00:04
   ------------- -------------------------- 99.0/300.9 MB 50.4 MB/s eta 0:00:05
   ------------- ------------------------- 101.8/300.9 MB 50.4 MB/s eta 0:00:04
   ------------- ------------------------- 103.9/300.9 MB 50.4 MB/s eta 0:00:04
   ------------- ------------------------- 106.7/300.9 MB 50.4 MB/s eta 0:00:04
   -------------- ------------------------ 109.0/300.9 MB 54.4 MB/s eta 0:00:04
   -------------- ------------------------ 111.6/300.9 MB 54.4 MB/s eta 0:00:04
   -------------- ------------------------ 114.0/300.9 MB 54.7 MB/s eta 0:00:04
   --------------- ----------------------- 116.7/300.9 MB 54.7 MB/s eta 0:00:04
   --------------- ----------------------- 119.0/300.9 MB 54.7 MB/s eta 0:00:04
   --------------- ----------------------- 121.7/300.9 MB 54.7 MB/s eta 0:00:04
   ---------------- ---------------------- 124.4/300.9 MB 54.4 MB/s eta 0:00:04
   ---------------- ---------------------- 126.8/300.9 MB 59.5 MB/s eta 0:00:03
   ---------------- ---------------------- 129.3/300.9 MB 59.5 MB/s eta 0:00:03
   ----------------- --------------------- 132.0/300.9 MB 54.4 MB/s eta 0:00:04
   ----------------- --------------------- 134.7/300.9 MB 54.7 MB/s eta 0:00:04
   ----------------- --------------------- 137.2/300.9 MB 54.7 MB/s eta 0:00:03
   ------------------ -------------------- 139.7/300.9 MB 59.5 MB/s eta 0:00:03
   ------------------ -------------------- 142.2/300.9 MB 54.7 MB/s eta 0:00:03
   ------------------ -------------------- 144.6/300.9 MB 50.4 MB/s eta 0:00:04
   ------------------ -------------------- 146.2/300.9 MB 50.4 MB/s eta 0:00:04
   ------------------- ------------------- 147.1/300.9 MB 40.9 MB/s eta 0:00:04
   ------------------- ------------------- 147.7/300.9 MB 43.5 MB/s eta 0:00:04
   ------------------- ------------------- 148.6/300.9 MB 34.4 MB/s eta 0:00:05
   ------------------- ------------------- 151.4/300.9 MB 36.4 MB/s eta 0:00:05
   ------------------- ------------------- 153.1/300.9 MB 32.8 MB/s eta 0:00:05
   -------------------- ------------------ 155.1/300.9 MB 31.2 MB/s eta 0:00:05
   -------------------- ------------------ 156.9/300.9 MB 36.4 MB/s eta 0:00:04
   -------------------- ------------------ 158.4/300.9 MB 38.5 MB/s eta 0:00:04
   -------------------- ------------------ 160.3/300.9 MB 38.5 MB/s eta 0:00:04
   -------------------- ------------------ 162.0/300.9 MB 40.9 MB/s eta 0:00:04
   --------------------- ----------------- 164.3/300.9 MB 40.9 MB/s eta 0:00:04
   --------------------- ----------------- 166.2/300.9 MB 40.9 MB/s eta 0:00:04
   --------------------- ----------------- 167.9/300.9 MB 40.9 MB/s eta 0:00:04
   --------------------- ----------------- 169.7/300.9 MB 40.9 MB/s eta 0:00:04
   ---------------------- ---------------- 171.9/300.9 MB 40.9 MB/s eta 0:00:04
   ---------------------- ---------------- 173.9/300.9 MB 43.7 MB/s eta 0:00:03
   ---------------------- ---------------- 175.7/300.9 MB 43.7 MB/s eta 0:00:03
   ----------------------- --------------- 177.6/300.9 MB 43.7 MB/s eta 0:00:03
   ----------------------- --------------- 179.6/300.9 MB 43.5 MB/s eta 0:00:03
   ----------------------- --------------- 181.6/300.9 MB 43.5 MB/s eta 0:00:03
   ----------------------- --------------- 183.7/300.9 MB 43.5 MB/s eta 0:00:03
   ------------------------ -------------- 185.8/300.9 MB 43.7 MB/s eta 0:00:03
   ------------------------ -------------- 187.7/300.9 MB 43.7 MB/s eta 0:00:03
   ------------------------ -------------- 189.7/300.9 MB 43.7 MB/s eta 0:00:03
   ------------------------ -------------- 191.6/300.9 MB 43.7 MB/s eta 0:00:03
   ------------------------ -------------- 192.8/300.9 MB 43.7 MB/s eta 0:00:03
   ------------------------- ------------- 194.4/300.9 MB 38.5 MB/s eta 0:00:03
   ------------------------- ------------- 196.3/300.9 MB 38.5 MB/s eta 0:00:03
   ------------------------- ------------- 198.4/300.9 MB 38.5 MB/s eta 0:00:03
   ------------------------- ------------- 200.2/300.9 MB 38.6 MB/s eta 0:00:03
   -------------------------- ------------ 202.6/300.9 MB 38.6 MB/s eta 0:00:03
   -------------------------- ------------ 204.6/300.9 MB 43.7 MB/s eta 0:00:03
   -------------------------- ------------ 206.6/300.9 MB 43.7 MB/s eta 0:00:03
   --------------------------- ----------- 208.6/300.9 MB 46.7 MB/s eta 0:00:02
   --------------------------- ----------- 210.5/300.9 MB 43.5 MB/s eta 0:00:03
   --------------------------- ----------- 212.7/300.9 MB 43.5 MB/s eta 0:00:03
   --------------------------- ----------- 214.8/300.9 MB 43.5 MB/s eta 0:00:02
   ---------------------------- ---------- 217.0/300.9 MB 43.7 MB/s eta 0:00:02
   ---------------------------- ---------- 219.1/300.9 MB 43.7 MB/s eta 0:00:02
   ---------------------------- ---------- 221.2/300.9 MB 46.7 MB/s eta 0:00:02
   ---------------------------- ---------- 222.8/300.9 MB 43.7 MB/s eta 0:00:02
   ----------------------------- --------- 225.2/300.9 MB 43.7 MB/s eta 0:00:02
   ----------------------------- --------- 227.4/300.9 MB 43.5 MB/s eta 0:00:02
   ----------------------------- --------- 229.5/300.9 MB 43.5 MB/s eta 0:00:02
   ------------------------------ -------- 231.6/300.9 MB 43.5 MB/s eta 0:00:02
   ------------------------------ -------- 233.7/300.9 MB 46.9 MB/s eta 0:00:02
   ------------------------------ -------- 236.2/300.9 MB 46.9 MB/s eta 0:00:02
   ------------------------------ -------- 238.3/300.9 MB 46.7 MB/s eta 0:00:02
   ------------------------------- ------- 240.2/300.9 MB 46.7 MB/s eta 0:00:02
   ------------------------------- ------- 242.3/300.9 MB 46.7 MB/s eta 0:00:02
   ------------------------------- ------- 244.5/300.9 MB 43.5 MB/s eta 0:00:02
   ------------------------------- ------- 246.8/300.9 MB 43.5 MB/s eta 0:00:02
   -------------------------------- ------ 249.0/300.9 MB 46.7 MB/s eta 0:00:02
   -------------------------------- ------ 251.1/300.9 MB 50.4 MB/s eta 0:00:01
   -------------------------------- ------ 253.2/300.9 MB 46.7 MB/s eta 0:00:02
   --------------------------------- ----- 255.3/300.9 MB 46.7 MB/s eta 0:00:01
   --------------------------------- ----- 257.4/300.9 MB 46.9 MB/s eta 0:00:01
   --------------------------------- ----- 259.6/300.9 MB 46.9 MB/s eta 0:00:01
   --------------------------------- ----- 261.8/300.9 MB 50.4 MB/s eta 0:00:01
   ---------------------------------- ---- 264.0/300.9 MB 46.7 MB/s eta 0:00:01
   ---------------------------------- ---- 265.9/300.9 MB 50.4 MB/s eta 0:00:01
   ---------------------------------- ---- 267.4/300.9 MB 43.5 MB/s eta 0:00:01
   ---------------------------------- ---- 269.8/300.9 MB 43.5 MB/s eta 0:00:01
   ----------------------------------- --- 271.4/300.9 MB 40.9 MB/s eta 0:00:01
   ----------------------------------- --- 273.1/300.9 MB 40.9 MB/s eta 0:00:01
   ----------------------------------- --- 275.4/300.9 MB 38.6 MB/s eta 0:00:01
   ------------------------------------ -- 277.8/300.9 MB 43.7 MB/s eta 0:00:01
   ------------------------------------ -- 279.4/300.9 MB 40.9 MB/s eta 0:00:01
   ------------------------------------ -- 281.5/300.9 MB 43.7 MB/s eta 0:00:01
   ------------------------------------ -- 283.9/300.9 MB 46.7 MB/s eta 0:00:01
   ------------------------------------- - 286.1/300.9 MB 46.7 MB/s eta 0:00:01
   ------------------------------------- - 288.4/300.9 MB 46.7 MB/s eta 0:00:01
   ------------------------------------- - 290.7/300.9 MB 50.4 MB/s eta 0:00:01
   ------------------------------------- - 292.4/300.9 MB 46.9 MB/s eta 0:00:01
   --------------------------------------  294.8/300.9 MB 46.7 MB/s eta 0:00:01
   --------------------------------------  296.8/300.9 MB 46.7 MB/s eta 0:00:01
   --------------------------------------  299.1/300.9 MB 46.7 MB/s eta 0:00:01
   --------------------------------------  300.9/300.9 MB 43.5 MB/s eta 0:00:01
   --------------------------------------  300.9/300.9 MB 43.5 MB/s eta 0:00:01
   --------------------------------------  300.9/300.9 MB 43.5 MB/s eta 0:00:01
   --------------------------------------  300.9/300.9 MB 43.5 MB/s eta 0:00:01
   --------------------------------------  300.9/300.9 MB 43.5 MB/s eta 0:00:01
   --------------------------------------  300.9/300.9 MB 43.5 MB/s eta 0:00:01
   --------------------------------------  300.9/300.9 MB 43.5 MB/s eta 0:00:01
   --------------------------------------- 300.9/300.9 MB 14.9 MB/s eta 0:00:00
Downloading absl_py-2.0.0-py3-none-any.whl (130 kB)
   ---------------------------------------- 0.0/130.2 kB ? eta -:--:--
   ---------------------------------------- 130.2/130.2 kB 7.5 MB/s eta 0:00:00
Downloading flatbuffers-23.5.26-py2.py3-none-any.whl (26 kB)
Downloading grpcio-1.59.3-cp311-cp311-win_amd64.whl (3.7 MB)
   ---------------------------------------- 0.0/3.7 MB ? eta -:--:--
   ---------------------- ----------------- 2.1/3.7 MB 67.1 MB/s eta 0:00:01
   ---------------------------------------- 3.7/3.7 MB 46.9 MB/s eta 0:00:00
Downloading keras-2.15.0-py3-none-any.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 1.7/1.7 MB 36.2 MB/s eta 0:00:00
Downloading libclang-16.0.6-py2.py3-none-win_amd64.whl (24.4 MB)
   ---------------------------------------- 0.0/24.4 MB ? eta -:--:--
   --- ------------------------------------ 2.0/24.4 MB 63.1 MB/s eta 0:00:01
   ------ --------------------------------- 4.0/24.4 MB 51.5 MB/s eta 0:00:01
   ---------- ----------------------------- 6.2/24.4 MB 49.2 MB/s eta 0:00:01
   ------------- -------------------------- 8.2/24.4 MB 47.8 MB/s eta 0:00:01
   ----------------- ---------------------- 10.6/24.4 MB 46.7 MB/s eta 0:00:01
   -------------------- ------------------- 12.4/24.4 MB 43.5 MB/s eta 0:00:01
   ------------------------ --------------- 14.7/24.4 MB 43.5 MB/s eta 0:00:01
   -------------------------- ------------- 16.4/24.4 MB 43.7 MB/s eta 0:00:01
   ------------------------------ --------- 18.6/24.4 MB 43.7 MB/s eta 0:00:01
   --------------------------------- ------ 20.6/24.4 MB 43.7 MB/s eta 0:00:01
   ------------------------------------- -- 22.8/24.4 MB 43.7 MB/s eta 0:00:01
   ---------------------------------------  24.1/24.4 MB 38.5 MB/s eta 0:00:01
   ---------------------------------------- 24.4/24.4 MB 34.4 MB/s eta 0:00:00
Downloading ml_dtypes-0.2.0-cp311-cp311-win_amd64.whl (938 kB)
   ---------------------------------------- 0.0/938.7 kB ? eta -:--:--
   --------------------------------------- 938.7/938.7 kB 30.0 MB/s eta 0:00:00
Downloading tensorboard-2.15.1-py3-none-any.whl (5.5 MB)
   ---------------------------------------- 0.0/5.5 MB ? eta -:--:--
   ---------------- ----------------------- 2.3/5.5 MB 49.6 MB/s eta 0:00:01
   ------------------------------- -------- 4.4/5.5 MB 56.6 MB/s eta 0:00:01
   ---------------------------------------- 5.5/5.5 MB 44.2 MB/s eta 0:00:00
Downloading protobuf-4.23.4-cp310-abi3-win_amd64.whl (422 kB)
   ---------------------------------------- 0.0/422.5 kB ? eta -:--:--
   --------------------------------------- 422.5/422.5 kB 27.5 MB/s eta 0:00:00
Downloading tensorflow_estimator-2.15.0-py2.py3-none-any.whl (441 kB)
   ---------------------------------------- 0.0/442.0 kB ? eta -:--:--
   --------------------------------------- 442.0/442.0 kB 28.8 MB/s eta 0:00:00
Downloading termcolor-2.4.0-py3-none-any.whl (7.7 kB)
Downloading google_auth-2.24.0-py2.py3-none-any.whl (183 kB)
   ---------------------------------------- 0.0/183.8 kB ? eta -:--:--
   ---------------------------------------- 183.8/183.8 kB ? eta 0:00:00
Downloading google_auth_oauthlib-1.1.0-py2.py3-none-any.whl (19 kB)
Downloading tensorboard_data_server-0.7.2-py3-none-any.whl (2.4 kB)
Downloading cachetools-5.3.2-py3-none-any.whl (9.3 kB)
Installing collected packages: libclang, flatbuffers, termcolor, tensorflow-io-gcs-filesystem, tensorflow-estimator, tensorboard-data-server, rsa, protobuf, opt-einsum, oauthlib, ml-dtypes, keras, grpcio, google-pasta, gast, cachetools, astunparse, absl-py, requests-oauthlib, google-auth, google-auth-oauthlib, tensorboard, tensorflow-intel, tensorflow
Successfully installed absl-py-2.0.0 astunparse-1.6.3 cachetools-5.3.2 flatbuffers-23.5.26 gast-0.5.4 google-auth-2.24.0 google-auth-oauthlib-1.1.0 google-pasta-0.2.0 grpcio-1.59.3 keras-2.15.0 libclang-16.0.6 ml-dtypes-0.2.0 oauthlib-3.2.2 opt-einsum-3.3.0 protobuf-4.23.4 requests-oauthlib-1.3.1 rsa-4.9 tensorboard-2.15.1 tensorboard-data-server-0.7.2 tensorflow-2.15.0 tensorflow-estimator-2.15.0 tensorflow-intel-2.15.0 tensorflow-io-gcs-filesystem-0.31.0 termcolor-2.4.0
Note: you may need to restart the kernel to use updated packages.
In [332]:
pip install --upgrade tensorflow
Requirement already satisfied: tensorflow in c:\users\tanub\anaconda3\lib\site-packages (2.15.0)
Requirement already satisfied: tensorflow-intel==2.15.0 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow) (2.15.0)
Requirement already satisfied: absl-py>=1.0.0 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (2.0.0)
Requirement already satisfied: astunparse>=1.6.0 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (1.6.3)
Requirement already satisfied: flatbuffers>=23.5.26 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (23.5.26)
Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (0.5.4)
Requirement already satisfied: google-pasta>=0.1.1 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (0.2.0)
Requirement already satisfied: h5py>=2.9.0 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (3.9.0)
Requirement already satisfied: libclang>=13.0.0 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (16.0.6)
Requirement already satisfied: ml-dtypes~=0.2.0 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (0.2.0)
Requirement already satisfied: numpy<2.0.0,>=1.23.5 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (1.24.3)
Requirement already satisfied: opt-einsum>=2.3.2 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (3.3.0)
Requirement already satisfied: packaging in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (23.1)
Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (4.23.4)
Requirement already satisfied: setuptools in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (68.0.0)
Requirement already satisfied: six>=1.12.0 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (1.16.0)
Requirement already satisfied: termcolor>=1.1.0 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (2.4.0)
Requirement already satisfied: typing-extensions>=3.6.6 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (4.7.1)
Requirement already satisfied: wrapt<1.15,>=1.11.0 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (1.14.1)
Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (0.31.0)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (1.59.3)
Requirement already satisfied: tensorboard<2.16,>=2.15 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (2.15.1)
Requirement already satisfied: tensorflow-estimator<2.16,>=2.15.0 in c:\users\tanub\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (2.15.0)
Collecting keras<2.16,>=2.15.0 (from tensorflow-intel==2.15.0->tensorflow)
  Obtaining dependency information for keras<2.16,>=2.15.0 from https://files.pythonhosted.org/packages/fc/a7/0d4490de967a67f68a538cc9cdb259bff971c4b5787f7765dc7c8f118f71/keras-2.15.0-py3-none-any.whl.metadata
  Using cached keras-2.15.0-py3-none-any.whl.metadata (2.4 kB)
Requirement already satisfied: wheel<1.0,>=0.23.0 in c:\users\tanub\anaconda3\lib\site-packages (from astunparse>=1.6.0->tensorflow-intel==2.15.0->tensorflow) (0.38.4)
Requirement already satisfied: google-auth<3,>=1.6.3 in c:\users\tanub\anaconda3\lib\site-packages (from tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (2.24.0)
Requirement already satisfied: google-auth-oauthlib<2,>=0.5 in c:\users\tanub\anaconda3\lib\site-packages (from tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (1.1.0)
Requirement already satisfied: markdown>=2.6.8 in c:\users\tanub\anaconda3\lib\site-packages (from tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (3.4.1)
Requirement already satisfied: requests<3,>=2.21.0 in c:\users\tanub\anaconda3\lib\site-packages (from tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (2.31.0)
Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in c:\users\tanub\anaconda3\lib\site-packages (from tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (0.7.2)
Requirement already satisfied: werkzeug>=1.0.1 in c:\users\tanub\anaconda3\lib\site-packages (from tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (2.2.3)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in c:\users\tanub\anaconda3\lib\site-packages (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (5.3.2)
Requirement already satisfied: pyasn1-modules>=0.2.1 in c:\users\tanub\anaconda3\lib\site-packages (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (0.2.8)
Requirement already satisfied: rsa<5,>=3.1.4 in c:\users\tanub\anaconda3\lib\site-packages (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (4.9)
Requirement already satisfied: requests-oauthlib>=0.7.0 in c:\users\tanub\anaconda3\lib\site-packages (from google-auth-oauthlib<2,>=0.5->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (1.3.1)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\tanub\anaconda3\lib\site-packages (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in c:\users\tanub\anaconda3\lib\site-packages (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\tanub\anaconda3\lib\site-packages (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\tanub\anaconda3\lib\site-packages (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (2023.11.17)
Requirement already satisfied: MarkupSafe>=2.1.1 in c:\users\tanub\anaconda3\lib\site-packages (from werkzeug>=1.0.1->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (2.1.1)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in c:\users\tanub\anaconda3\lib\site-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (0.4.8)
Requirement already satisfied: oauthlib>=3.0.0 in c:\users\tanub\anaconda3\lib\site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<2,>=0.5->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (3.2.2)
Using cached keras-2.15.0-py3-none-any.whl (1.7 MB)
Installing collected packages: keras
  Attempting uninstall: keras
    Found existing installation: keras 3.0.0
    Uninstalling keras-3.0.0:
      Successfully uninstalled keras-3.0.0
Successfully installed keras-2.15.0
Note: you may need to restart the kernel to use updated packages.

Importing necessary packages¶

In [334]:
import torch
import tensorflow as tf
import torch.nn as nn
import torch.nn.functional as func
import torch.optim as optim
from torch.utils.data import DataLoader

# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from tensorflow.keras.layers import BatchNormalization

import copy
from datetime import datetime
import pickle
import time
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import torch.nn as nn
import torch.nn.functional as func
import torch.optim as optim
from torch.optim import lr_scheduler

# Metrics
from sklearn.metrics import auc

Single Layer Neural Network¶

A Single Layer Neural Network, often referred to as a single-layer perceptron (SLP), is the simplest form of a neural network architecture. It consists of only one layer of artificial neurons, or perceptrons. This layer is the output layer, and it directly produces the final output without any hidden layers.

  1. Structure:

    • Input Layer: The input layer contains nodes (neurons) representing the features of the input data. Each node is connected to the output layer.
    • Output Layer: The output layer produces the final output. For binary classification, there is typically one node with a sigmoid activation function, while for multi-class classification, there might be multiple nodes with softmax activation.
  2. Activation Function:

    • Single Layer Neural Networks often use simple activation functions, such as the step function for binary classification or the sigmoid function for binary logistic regression. For multi-class classification, softmax activation is commonly used.
  3. Training:

    • Training a single-layer neural network involves adjusting the weights and biases based on the difference between the predicted output and the true output. This is typically done using a supervised learning algorithm, and common optimization techniques include gradient descent.
  4. Limitations:

    • Single Layer Neural Networks have limitations in their ability to learn complex patterns and relationships in data, especially when dealing with non-linearly separable problems. They are not capable of learning more intricate representations of data as there are no hidden layers for feature transformation.

image.png

In [335]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
cpu

training, test data preperation¶

full_X_train = data_prep_pipeline.fit_transform(X_train) full_X_test = data_prep_pipeline.fit_transform(X_test)

full_X_train_gpu = torch.FloatTensor(full_X_train) full_X_test_gpu = torch.FloatTensor(full_X_test)

y_train_gpu = torch.FloatTensor(y_train.to_numpy()) y_test_gpu = torch.FloatTensor(y_test.to_numpy())

In [364]:
full_X_test_gpu.shape,full_X_train_gpu.shape
Out[364]:
(torch.Size([1231, 138]), torch.Size([2439, 138]))
In [365]:
results = pd.DataFrame(columns=["ExpID", 
              "Train Acc", "Val Acc", "Test Acc", "p-value",
              "Train AUC", "Val AUC", "Test AUC",
              "Train f1", "Val f1", "Test f1",
              "Train logloss", "Val logloss", "Test logloss",
              "Train Time(s)", "Val Time(s)", "Test Time(s)", 
              "Experiment description",
              "Top 10 Features"])

Sigmoid layer for the probability of prediction¶

In neural networks, a sigmoid layer is commonly used at the output layer when the task involves binary classification or when the goal is to produce probabilities. The sigmoid function, also known as the logistic function, is employed to squash the network's output to a range between 0 and 1, representing the probability of belonging to the positive class.

The sigmoid function is defined as:

[ \sigma(x) = \frac{1}{1 + e^{-x}} ]

Here, ( x ) is the weighted sum of the inputs and biases. The sigmoid function maps this sum to a value between 0 and 1, which can be interpreted as the probability of the input belonging to the positive class.

In the context of binary classification, the output can be thresholded to make a final decision. For example, if the sigmoid output is greater than or equal to 0.5, the input is classified as belonging to the positive class; otherwise, it is classified as belonging to the negative class.

Mathematically, if ( p ) is the output of the sigmoid layer, the final binary prediction ( \hat{y} ) can be obtained as:

[ \hat{y} = \begin{cases} 1 & \text{if } p \geq 0.5 \\ 0 & \text{if } p < 0.5 \end{cases} ]

The sigmoid layer is especially useful for binary classification tasks, such as spam detection, fraud detection, or any problem where the goal is to predict one of two possible outcomes. It allows the neural network to output probabilities, which can be interpreted and used to make decisions based on a chosen threshold.

Keep in mind that for multi-class classification problems, a softmax layer is commonly used instead of a sigmoid layer. The softmax function generalizes the sigmoid function to multiple classes, providing a probability distribution over all possible classes.

image.png

Neural Network Architechture Srting code¶

"Input(100) - Hidden(20) - Sigmoid - Output(1)"

In [366]:
D_in = full_X_train_gpu.shape[1]
D_hidden1 = 20
D_hidden2 = 10
D_out= 1
model1 = torch.nn.Sequential( 
    torch.nn.Linear(D_in, D_out),
    nn.Sigmoid())
In [367]:
learning_rate = 0.01
optimizer = torch.optim.Adam(model1.parameters(), lr=learning_rate)
model1 = model1
In [368]:
def return_report(y, y_prob):
  _, y_pred = torch.max(y_prob, dim = 1)
  y_pred = y_pred.cpu().numpy()
  acc = accuracy_score(y, y_pred)
  roc_auc = roc_auc_score(y, y_prob.cpu().detach().numpy())

  return_list = ([round(acc,4), round(roc_auc, 4)])

  return return_list
In [369]:
def print_report(y, y_prob):
  _, y_pred = torch.max(y_prob, dim = 1)
  y_pred = y_pred.cpu().numpy()
  acc = accuracy_score(y, y_pred)
  roc_auc = roc_auc_score(y, y_prob.cpu().detach().numpy())

  print(f'Accuracy : {round(acc,4)} ; ROC_AUC : {round(roc_auc, 4)}')

Training Neural network with one layer¶

In [371]:
epochs = 500
y_train_gpu = y_train_gpu.reshape(-1, 1)
print('Train data : ')
model1.train()
for i in range(epochs):
  

  y_train_pred_prob = model1(full_X_train_gpu)

  loss = func.binary_cross_entropy(y_train_pred_prob, y_train_gpu)
  optimizer.zero_grad()
  #loss = loss_func(y_train_pred_prob, y_train_gpu)
  loss.backward()
  optimizer.step()

  if i % 50 == 49:
    print(f"Epoch {i + 1}:")
    print_report(y_train, y_train_pred_prob)
Train data : 
Epoch 50:
Accuracy : 0.9241 ; ROC_AUC : 0.8321
Epoch 100:
Accuracy : 0.9241 ; ROC_AUC : 0.8333
Epoch 150:
Accuracy : 0.9241 ; ROC_AUC : 0.8344
Epoch 200:
Accuracy : 0.9241 ; ROC_AUC : 0.8353
Epoch 250:
Accuracy : 0.9241 ; ROC_AUC : 0.8362
Epoch 300:
Accuracy : 0.9241 ; ROC_AUC : 0.8367
Epoch 350:
Accuracy : 0.9241 ; ROC_AUC : 0.8373
Epoch 400:
Accuracy : 0.9241 ; ROC_AUC : 0.8378
Epoch 450:
Accuracy : 0.9241 ; ROC_AUC : 0.8382
Epoch 500:
Accuracy : 0.9241 ; ROC_AUC : 0.8385

Model Evaluation¶

In [372]:
model1.eval()
y_test_gpu = y_test_gpu.reshape(-1, 1)
with torch.no_grad():
    y_test_pred_prob=model1(full_X_test_gpu)
    print('-' * 50)
    print('Test data : ')
    print_report(y_test, y_test_pred_prob)
    print('-' * 50)
--------------------------------------------------
Test data : 
Accuracy : 0.9236 ; ROC_AUC : 0.7463
--------------------------------------------------
In [375]:
X_kaggle_test=datasets['X_kaggle_test']
kaggle_test = X_kaggle_test[selected_features]
X_kaggle_test.shape,kaggle_test.shape
Out[375]:
((48744, 704), (48744, 132))

Prepare and submit to Kaggle¶

In [376]:
final_X_kaggle_test = kaggle_test
final_X_kaggle_test = data_prep_pipeline.fit_transform(final_X_kaggle_test)
full_X_kaggle_gpu = torch.FloatTensor(final_X_kaggle_test)
full_X_kaggle_gpu.shape
Out[376]:
torch.Size([48744, 138])
In [377]:
  model1.eval()
  test_class_scores = model1(full_X_kaggle_gpu)
  print(test_class_scores[0:10])
tensor([[0.0893],
        [0.2477],
        [0.0420],
        [0.0821],
        [0.1985],
        [0.0573],
        [0.0207],
        [0.0799],
        [0.0172],
        [0.0913]], grad_fn=<SliceBackward0>)
In [379]:
  #For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. 
  fs_type = "simple_nn"
  submit_df = datasets["application_test"][['SK_ID_CURR']]
  submit_df['TARGET'] = test_class_scores.detach().cpu().numpy()
  print(submit_df.head(2))
  submit_df.to_csv(f'C:/Users/tanub/Courses/AML526/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk/Phase2/submission_{fs_type}.csv',index=False)
   SK_ID_CURR    TARGET
0      100001  0.089309
1      100005  0.247659

Neural Network model with multilayers and user defined CXE and hinge loss functions¶

image.png

The theoretical aspects of building a neural network model with multiple layers and using user-defined loss functions, specifically Cross-Entropy (CXE) and Hinge Loss.

Neural Network Architecture:¶

  1. Input Layer:

    • The input layer receives the features of your data.
  2. Hidden Layers:

    • Multiple hidden layers can be added to capture complex patterns in the data. Each hidden layer consists of neurons that apply activation functions to the weighted sum of inputs.
  3. Activation Functions:

    • Activation functions, such as ReLU (Rectified Linear Unit) for hidden layers, introduce non-linearity, allowing the network to learn complex mappings.
  4. Output Layer:

    • The output layer produces the final predictions. For binary classification, a single neuron with a sigmoid activation function is commonly used. For multi-class classification, a softmax activation function is often applied.

Custom Loss Functions:¶

1. Hinge Loss:¶

  • Definition:
    • The hinge loss is commonly used in Support Vector Machines (SVMs) and is suitable for binary classification tasks.
    • (L(y, f(x)) = \max(0, 1 - y \cdot f(x))), where (y) is the true label (-1 or 1), (f(x)) is the raw output of the model.
  • Usage:
    • Penalizes misclassifications and encourages the model to have a margin of at least 1 for correct classifications.

2. Cross-Entropy Loss:¶

  • Definition:
    • Cross-Entropy loss is frequently used in classification tasks.
    • For binary classification: (L(y, p) = -[y \cdot \log(p) + (1 - y) \cdot \log(1 - p)]).
    • For multi-class classification: (L(y, p) = -\sum_{i} y_i \cdot \log(p_i)), where (y_i) is the true distribution and (p_i) is the predicted distribution.
  • Usage:
    • Measures the dissimilarity between the true distribution and predicted distribution.

Training:¶

  • During training, the model's parameters (weights and biases) are adjusted using optimization algorithms (e.g., gradient descent) to minimize the chosen loss function.

one linear layer, one hidden layer with Relu function and sigmoid function for probability prediction¶

String Architecture "Input({input_features}) - Hidden1({80}Relu) - Hidden2({80}Sigmoid) - Output({1})"¶

In [380]:
## Model using hidden layers
class SVMNNmodel(nn.Module):
  def __init__(self, input_features , hidden1 = 80, hidden2 = 80, output_features = 1):
    super(SVMNNmodel, self).__init__()
    # self.f_connected1 = nn.Linear(input_features, hidden1)
    # self.f_connected2 = nn.Linear(hidden1, hidden2)
    # self.out = nn.Linear(hidden2, output_features)
    # self.sigmoid = nn.Sigmoid()
    self.f_connected1 = nn.Linear(input_features, hidden1)
    self.out = nn.Linear(hidden1, output_features)

  def forward(self, x):
    #x = func.relu(self.f_connected1(x))
    #x= func.relu(self.f_connected2(x))
    #x = self.out(x)
    #return self.sigmoid(x)
    h_relu = torch.relu(self.f_connected1(x))
    y_target_pred = torch.sigmoid(self.out(h_relu))
    return y_target_pred

The hinge loss is used to train models to make correct predictions while penalizing them more for being confidently wrong. This is particularly useful when dealing with non-linearly separable data or when there is noise in the dataset.

Here's a brief breakdown of our described model:

  1. One Linear Layer: This is the input layer, where the features of your data are fed into the model. The linear layer applies weights to the input features without introducing non-linearity.

  2. One Hidden Layer with ReLU Activation: The ReLU (Rectified Linear Unit) activation function is applied element-wise to the output of the linear layer. It introduces non-linearity to the model by outputting the input for all positive values and zero for all negative values. This allows the model to learn complex relationships in the data.

  3. Sigmoid Function for Probability Prediction: The output layer uses the sigmoid activation function. This function squashes the values between 0 and 1, making it suitable for binary classification problems where you want to output probabilities. In your case, it seems like you are using this to obtain the probability of belonging to the positive class.

  4. Hinge Loss Function: The hinge loss is a loss function used in SVMs and is effective for binary classification problems, especially when dealing with non-linearly separable data. It encourages the correct classification of data points while penalizing misclassifications, with a particular focus on instances that are close to the decision boundary.

To extend the hard SVM to handle noisy or non-linearly separable data, the hinge loss allows for a more flexible decision boundary. It penalizes misclassifications based on how far they are from the correct side of the decision boundary, providing robustness to noise and handling cases where a perfect separation is not possible.

image.png

In [381]:
class SVMLoss(nn.Module):
  def __init__(self):
    super(SVMLoss,self).__init__()
  def forward(self,outputs,labels,model2):
     C = 0.10
    # data_loss = torch.mean(torch.clamp(1 - outputs.squeeze().t() == labels,min=0))
     data_loss = torch.mean(torch.clamp(1 - outputs.squeeze(),min=0))
     weight = model2.out.weight.squeeze()
     reg_loss = weight.t() @ weight
     reg_loss = reg_loss + ( model2.out.bias.squeeze()**2)
     hinge = data_loss +( C*reg_loss/2)
     return (hinge)
In [382]:
class Converttensor(Dataset):
    def __init__(self, feature, label, mode ='train', transforms=None):
        """
        Initialize data set as a list of IDs corresponding to each item of data set

        :param feature: x - numpy array
        :param label: y - numpy array
        """

        self.x = feature
        self.y = label

    
    def __len__(self):
        """
        Return the length of data set using list of IDs

        :return: number of samples in data set
        """
        return (self.x.shape[0])

    def __getitem__(self, index):
        """
        Generate one item of data set.

        :param index: index of item in IDs list

        :return: image and label and bouding box params
        """
        x = self.x[index,:]
        y_target = self.y[index]

        x = torch.FloatTensor(x)
        y_target_arr = np.array(y_target)
        return x, y_target_arr
In [383]:
fprs_net_train, tprs_net_train, fprs_net_valid, tprs_net_valid = [], [], [], []
roc_auc_net_train = 0.0
roc_auc_net_valid = 0.0
num_epochs=25
batch_size=256
CASE_NAME = "NN"

Preparing Data for the MNN¶

In [385]:
splits = 1

# Train Test split percentage
subsample_rate = 0.3

finaldf = np.array_split(train_dataset, splits)
X_train = finaldf[0][selected_features]
y_train = finaldf[0]['TARGET']
final_X_kaggle_test = kaggle_test
## split part of data
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, stratify=y_train,
                                                    test_size=subsample_rate, random_state=42)

X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,stratify=y_train,test_size=0.15, random_state=42)

nn_X_train = data_prep_pipeline.fit_transform(X_train)
nn_X_valid = data_prep_pipeline.fit_transform(X_valid)
nn_X_test = data_prep_pipeline.fit_transform(X_test)
nn_X_kaggle_test = data_prep_pipeline.fit_transform(final_X_kaggle_test)
full_X_kaggle_gpu = torch.FloatTensor(nn_X_kaggle_test)
nn_y_train = np.array(y_train)
nn_y_valid = np.array(y_valid)

in_feature_cnt = nn_X_train.shape[1]
out_feature_cnt = 1

print(f"X train           shape: {nn_X_train.shape}")
print(f"X validation      shape: {nn_X_valid.shape}")
print(f"X test            shape: {nn_X_test.shape}")
print(f"X kaggle_test     shape: {nn_X_kaggle_test.shape}")
print("Feature count           : ",in_feature_cnt)
X train           shape: (182968, 138)
X validation      shape: (32289, 138)
X test            shape: (92254, 138)
X kaggle_test     shape: (48744, 138)
Feature count           :  138
In [388]:
## Transform dataset
nn_dataset['train'] = Converttensor(nn_dataset['train'] ,nn_y_train, mode='train')
In [389]:
## Transform validation dataset
nn_dataset['val'] = Converttensor(nn_dataset['val'] ,nn_y_valid, mode='validation')
In [390]:
nn_dataset
Out[390]:
{'train': <__main__.Converttensor at 0x16cbce572d0>,
 'val': <__main__.Converttensor at 0x16cb9505f90>}
In [391]:
## Set dataloader
dataloaders = {x_type: torch.utils.data.DataLoader(nn_dataset[x_type], batch_size=batch_size,shuffle=True, num_workers=0)  
              for x_type in ['train', 'val']}  

Training the multi nueral network model¶

In [393]:
# Set model
nn_model = SVMNNmodel(input_features = in_feature_cnt, output_features= 1)
#nn_model = nn_model.float()
In [394]:
#del convergence
try:
       convergence
       epoch_offset=convergence.epoch.iloc[-1]+1
except NameError:
        convergence=pd.DataFrame(columns=['epoch','phase','roc_auc','accuracy','CXE','Hinge'])
        epoch_offset=0

This is a training loop for a neural network using custom loss functions (Cross-Entropy and Hinge Loss) and monitoring various performance metrics, such as accuracy and ROC AUC.

Custom Loss Functions:¶

  1. Hinge Loss:

    • The hinge loss is commonly used in Support Vector Machines (SVMs) and is suitable for binary classification tasks.
    • It is defined as (L(y, f(x)) = \max(0, 1 - y \cdot f(x))), where (y) is the true label (-1 or 1), and (f(x)) is the raw output of the model.
    • The loss penalizes misclassifications and encourages a margin of at least 1 for correct classifications.
  2. Cross-Entropy Loss:

    • Cross-Entropy loss is widely used in classification tasks.
    • For binary classification: (L(y, p) = -[y \cdot \log(p) + (1 - y) \cdot \log(1 - p)]).
    • For multi-class classification: (L(y, p) = -\sum_{i} y_i \cdot \log(p_i)), where (y_i) is the true distribution, and (p_i) is the predicted distribution.
    • The loss measures the dissimilarity between the true distribution and predicted distribution.

Training Loop:¶

  1. Data Loading:

    • The training loop iterates over batches of data from training and validation sets.
  2. Zeroing Gradients:

    • Gradients are zeroed before the backward pass to prevent accumulation from previous iterations.
  3. Forward Pass:

    • The model is set to training or evaluation mode based on the current phase.
    • The forward pass computes the output of the model for the given inputs.
  4. Loss Computation:

    • Both Hinge Loss and Cross-Entropy Loss are computed.
    • The weighted sum of the two losses is used as the final loss, with a weight ((w_{cel})) applied to the Cross-Entropy Loss.
  5. Backward Pass and Optimization:

    • Gradients are computed during the backward pass.
    • Optimization steps are performed separately for the Hinge Loss and Cross-Entropy Loss using different optimizers.
  6. Performance Metrics:

    • Accuracy, Hinge Loss, Cross-Entropy Loss, and ROC AUC are computed and printed for each phase (training or validation).
  7. Learning Rate Scheduling:

    • Learning rate schedulers (scheduler_cxe and scheduler_hinge) are used to adjust the learning rates during training.
  8. ROC AUC Calculation:

    • ROC AUC is calculated using the roc_curve and auc functions from scikit-learn.
  9. Visualization:

    • The ROC curve is plotted at the end of the final validation epoch.
  10. Best Model Tracking:

    • The best model weights are saved based on the highest validation accuracy.
In [409]:
def train(optimizer_cxe,optimizer_hinge,criteron,scheduler_cxe,scheduler_hinge,num_epochs=21, w_cel=1.0):
    
    global roc_auc_train
    global roc_auc_valid

    fac_cel=torch.tensor(w_cel)

    start = time.time()

    best_model_wts = copy.deepcopy(nn_model.state_dict())
    best_acc = 0.0

    # Store results to easier collect stats
    nn_y_pred = {x: np.zeros((dataset_sizes[x],1)) for x in ['train', 'val']}

    for epoch in range(num_epochs):

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            t0=time.time()
            # Reset to zero to be save
           
            nn_y_pred[phase].fill(0)
            if phase == 'train':
                nn_model.train()  # Set model to training mode
            else:
                nn_model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0
            running_hinge = 0.0
            running_cxe = 0.0

            # Iterate over data.
            ix=0
            for inputs, targets in dataloaders[phase]:
                n_batch = len(targets)
                
                #nn_y_pred[phase][ix:ix+n_batch,:] = targets.detach().numpy().reshape(-1,1)

                inputs = inputs.to(device)
                targets = targets.to(device).float()

                # zero the parameter gradients
                optimizer_hinge.zero_grad()
                optimizer_cxe.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    output_target = nn_model.forward(inputs)
                    preds = torch.where((output_target > .5), 1, 0)
                    #print(output_target.squeeze(),targets)
                    ix += n_batch
                    loss_cxe = func.binary_cross_entropy(output_target.squeeze(), targets)
                    loss_hinge = criteron.forward(output_target.squeeze(), targets,nn_model)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss_hinge.backward()
                        optimizer_hinge.step()
                        #loss_cxe.backward()
                        optimizer_cxe.step()

                # statistics
                running_hinge += loss_hinge.item() * inputs.size(0)
                running_corrects += 1*(preds == targets.data.int()).sum().item()
                running_cxe += loss_cxe.item() * inputs.size(0)

            if phase == 'train':
                scheduler_hinge.step()
                scheduler_cxe.step()

            epoch_cxe = running_cxe / dataset_sizes[phase]
            epoch_hinge = running_hinge / dataset_sizes[phase]
            epoch_acc = running_corrects / dataset_sizes[phase]                      

            epoch_roc_auc = 0.0 
            if (phase == 'train'):
                ## Calculate 'false_positive_rate' and 'True_positive_rate' of train
    
                nn_fprs_train, nn_tpr_train, nn_thresholds = roc_curve(targets.detach().cpu().numpy(), output_target.squeeze().detach().cpu().numpy())
                fprs_net_train.append(nn_fprs_train)
                tprs_net_train.append(nn_tpr_train)
                roc_auc_train = round(auc(nn_fprs_train, nn_tpr_train), 4)  
                epoch_roc_auc = roc_auc_train

            elif (phase == 'val'):
                ## Calculate 'false_positive_rate' and 'True_positive_rate' of valid
                nn_fpr_valid, nn_tpr_valid, thresholds = roc_curve(targets.detach().cpu().numpy(), output_target.squeeze().detach().cpu().numpy())
                fprs_net_valid.append(nn_fpr_valid)
                tprs_net_valid.append(nn_tpr_valid)
                roc_auc_valid = round(auc(nn_fpr_valid, nn_tpr_valid), 4)
                epoch_roc_auc = roc_auc_valid

            dt=time.time() - t0
            fmt='{:6s} ROC_AUC: {:.4f} Acc: {:.4f} CXE: {:.4f} Hinge: {:.4f}  DT={:.1f}'
            out_list=[phase, epoch_roc_auc, epoch_acc, epoch_cxe, epoch_hinge] + [dt]
            out_str=fmt.format(*out_list)
            if phase=='train':
                epoch_str='Epoch {}/{} '.format(epoch, num_epochs)
                out_str=epoch_str + out_str
            else:
                out_str = ' '*len(epoch_str) + out_str
            print(out_str)

            if (phase == 'val') and epoch == num_epochs-1:
                 plt.plot(nn_fprs_train, nn_tpr_train, color='blue') 
                 plt.plot(nn_fpr_valid, nn_tpr_valid, color='orange')
                 plt.xlim([0.0,1.0])
                 plt.ylim([0.0,1.0])
                 plt.xlabel('False Positive Rate')
                 plt.ylabel('True Positive Rate')
                 plt.title(f'ROC Curve Comparison')
                 plt.legend([f'TrainRocAuc (AUC = {roc_auc_train})', f'TestRocAuc (AUC = {roc_auc_valid})'])
                 plt.show()

            convergence.loc[len(convergence)] = [epoch+epoch_offset,phase,   
                        epoch_roc_auc, epoch_acc, epoch_cxe, epoch_hinge]
            
            # deep copy the model
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(nn_model.state_dict())
 
    time_elapsed = time.time() - start
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    nn_model.load_state_dict(best_model_wts)   

Run the model¶

In [410]:
optimizer_cxe = optim.Adam(nn_model.parameters(), lr=0.0001)
optimizer_hinge = torch.optim.SGD(nn_model.parameters(), lr=learning_rate,momentum = 0.5,weight_decay = 0.1)
nn_model = nn_model
scheduler_cxe = lr_scheduler.StepLR(optimizer_cxe, step_size=10, gamma=0.1)  
scheduler_hinge= lr_scheduler.StepLR(optimizer_hinge, step_size=10, gamma=0.1)
criteron = SVMLoss()
train(optimizer_cxe,optimizer_hinge,criteron,scheduler_cxe,scheduler_hinge,num_epochs=num_epochs, w_cel=0.000000001)

t0=time.time()
date_time = datetime.now().strftime("--%Y-%m-%d-%H-%M-%S-%f")
pickle.dump(nn_model,open(DATA_DIR + '/' + CASE_NAME + date_time + '.p','wb'))
print('Pickled in {:.2f} sec'.format(time.time()-t0))
Epoch 0/25 train  ROC_AUC: 0.7132 Acc: 20.6596 CXE: 3.4088 Hinge: 0.0695  DT=2.9
           val    ROC_AUC: 0.3222 Acc: 20.6486 CXE: 3.4057 Hinge: 0.0695  DT=0.4
Epoch 1/25 train  ROC_AUC: 0.6053 Acc: 20.6556 CXE: 3.4072 Hinge: 0.0696  DT=3.0
           val    ROC_AUC: 0.4062 Acc: 20.6624 CXE: 3.4058 Hinge: 0.0696  DT=0.4
Epoch 2/25 train  ROC_AUC: 0.4546 Acc: 20.6592 CXE: 3.4066 Hinge: 0.0696  DT=3.0
           val    ROC_AUC: 0.8750 Acc: 20.6624 CXE: 3.4082 Hinge: 0.0695  DT=0.4
Epoch 3/25 train  ROC_AUC: 0.5175 Acc: 20.6592 CXE: 3.4065 Hinge: 0.0696  DT=3.0
           val    ROC_AUC: 0.3871 Acc: 20.6555 CXE: 3.4077 Hinge: 0.0695  DT=0.4
Epoch 4/25 train  ROC_AUC: 0.4872 Acc: 20.6596 CXE: 3.4062 Hinge: 0.0696  DT=2.9
           val    ROC_AUC: 0.3750 Acc: 20.6624 CXE: 3.4069 Hinge: 0.0696  DT=0.4
Epoch 5/25 train  ROC_AUC: 0.5396 Acc: 20.6588 CXE: 3.4068 Hinge: 0.0696  DT=2.9
           val    ROC_AUC: 0.7000 Acc: 20.6486 CXE: 3.4031 Hinge: 0.0696  DT=0.4
Epoch 6/25 train  ROC_AUC: 0.4513 Acc: 20.6592 CXE: 3.4063 Hinge: 0.0696  DT=3.0
           val    ROC_AUC: 0.6173 Acc: 20.6279 CXE: 3.4105 Hinge: 0.0695  DT=0.4
Epoch 7/25 train  ROC_AUC: 0.7106 Acc: 20.6592 CXE: 3.4063 Hinge: 0.0696  DT=3.0
           val    ROC_AUC: 0.5444 Acc: 20.6486 CXE: 3.4074 Hinge: 0.0695  DT=0.4
Epoch 8/25 train  ROC_AUC: 0.5904 Acc: 20.6592 CXE: 3.4063 Hinge: 0.0696  DT=3.0
           val    ROC_AUC: 0.5667 Acc: 20.6486 CXE: 3.4125 Hinge: 0.0695  DT=0.4
Epoch 9/25 train  ROC_AUC: 0.5767 Acc: 20.6596 CXE: 3.4067 Hinge: 0.0696  DT=2.9
           val    ROC_AUC: nan Acc: 20.6693 CXE: 3.4089 Hinge: 0.0695  DT=0.4
Epoch 10/25 train  ROC_AUC: 0.6577 Acc: 20.6604 CXE: 3.4082 Hinge: 0.0696  DT=2.9
            val    ROC_AUC: 0.2889 Acc: 20.6486 CXE: 3.4056 Hinge: 0.0695  DT=0.4
Epoch 11/25 train  ROC_AUC: 0.5223 Acc: 20.6596 CXE: 3.4070 Hinge: 0.0696  DT=3.0
            val    ROC_AUC: 1.0000 Acc: 20.6555 CXE: 3.4049 Hinge: 0.0695  DT=0.4
Epoch 12/25 train  ROC_AUC: 0.4520 Acc: 20.6607 CXE: 3.4065 Hinge: 0.0696  DT=3.1
            val    ROC_AUC: 0.4111 Acc: 20.6486 CXE: 3.4051 Hinge: 0.0695  DT=0.4
Epoch 13/25 train  ROC_AUC: 0.4963 Acc: 20.6580 CXE: 3.4064 Hinge: 0.0696  DT=3.0
            val    ROC_AUC: 0.5111 Acc: 20.6486 CXE: 3.4060 Hinge: 0.0695  DT=0.4
Epoch 14/25 train  ROC_AUC: 0.6595 Acc: 20.6611 CXE: 3.4067 Hinge: 0.0696  DT=2.9
            val    ROC_AUC: 0.5948 Acc: 20.6417 CXE: 3.4064 Hinge: 0.0695  DT=0.4
Epoch 15/25 train  ROC_AUC: 0.5675 Acc: 20.6580 CXE: 3.4076 Hinge: 0.0696  DT=2.9
            val    ROC_AUC: 0.5323 Acc: 20.6555 CXE: 3.4048 Hinge: 0.0695  DT=0.4
Epoch 16/25 train  ROC_AUC: 0.5602 Acc: 20.6596 CXE: 3.4066 Hinge: 0.0696  DT=3.0
            val    ROC_AUC: 0.2929 Acc: 20.6348 CXE: 3.4054 Hinge: 0.0695  DT=0.4
Epoch 17/25 train  ROC_AUC: 0.4518 Acc: 20.6584 CXE: 3.4069 Hinge: 0.0696  DT=3.1
            val    ROC_AUC: 0.7672 Acc: 20.6417 CXE: 3.4050 Hinge: 0.0695  DT=0.4
Epoch 18/25 train  ROC_AUC: 0.4485 Acc: 20.6596 CXE: 3.4066 Hinge: 0.0696  DT=3.0
            val    ROC_AUC: 0.4569 Acc: 20.6417 CXE: 3.4051 Hinge: 0.0696  DT=0.4
Epoch 19/25 train  ROC_AUC: 0.6765 Acc: 20.6600 CXE: 3.4063 Hinge: 0.0696  DT=2.9
            val    ROC_AUC: 0.4397 Acc: 20.6417 CXE: 3.4059 Hinge: 0.0695  DT=0.4
Epoch 20/25 train  ROC_AUC: 0.5585 Acc: 20.6580 CXE: 3.4075 Hinge: 0.0696  DT=3.0
            val    ROC_AUC: 0.8438 Acc: 20.6624 CXE: 3.4057 Hinge: 0.0695  DT=0.4
Epoch 21/25 train  ROC_AUC: 0.4819 Acc: 20.6588 CXE: 3.4073 Hinge: 0.0696  DT=3.0
            val    ROC_AUC: 0.4483 Acc: 20.6417 CXE: 3.4056 Hinge: 0.0695  DT=0.4
Epoch 22/25 train  ROC_AUC: 0.4187 Acc: 20.6584 CXE: 3.4074 Hinge: 0.0696  DT=3.1
            val    ROC_AUC: 0.7111 Acc: 20.6486 CXE: 3.4055 Hinge: 0.0695  DT=0.4
Epoch 23/25 train  ROC_AUC: 0.5178 Acc: 20.6623 CXE: 3.4072 Hinge: 0.0696  DT=3.0
            val    ROC_AUC: 0.6034 Acc: 20.6417 CXE: 3.4055 Hinge: 0.0695  DT=0.4
Epoch 24/25 train  ROC_AUC: 0.5360 Acc: 20.6611 CXE: 3.4072 Hinge: 0.0696  DT=3.0
            val    ROC_AUC: 0.4889 Acc: 20.6486 CXE: 3.4054 Hinge: 0.0695  DT=0.4
Training complete in 1m 24s
Best val Acc: 20.669330
Pickled in 0.00 sec

Interpreting results¶

Based on the information provided, the ROC curve depicts the performance of a home credit default risk classifier that employs multilayer neural networks with hinge and cross-entropy loss functions. The classifier's ability to discriminate between borrowers who will repay their loans (true positives) and those who will default (false positives) is evaluated using the ROC curve.

The ROC curve indicates that the classifier achieves a true positive rate (TPR) of 0.8 and a false positive rate (FPR) of 0.2 at the point where the curve intersects the 0.5 line. This implies that the classifier accurately identifies 80% of borrowers who will default while erroneously identifying 20% of borrowers who will repay.

The Area Under the Curve (AUC) of the ROC curve, which measures the overall performance of the classifier, is 0.536. A higher AUC indicates better performance, and the AUC of 0.536 suggests moderate performance.

In essence, the ROC curve demonstrates that the classifier exhibits moderate capability in differentiating between borrowers who will default and those who will repay. However, there is still room for improvement.

To illustrate the ROC curve's interpretation in the context of home credit default risk assessment, consider the following:

  • The TPR of 0.8 implies that the classifier accurately identifies 80% of borrowers who will default.

  • The FPR of 0.2 indicates that 20% of borrowers who will repay their loans are erroneously identified as defaulters.

This suggests that the classifier effectively identifies defaulters but is also prone to false positives, potentially leading to the rejection of creditworthy borrowers.

The decision to utilize this classifier would depend on the specific context. For instance, if the consequences of default are severe, a higher FPR might be acceptable to prevent missing defaulters. However, if the consequences are less severe or if false positives incur substantial costs, a classifier with a lower FPR might be preferable.

In conclusion, the ROC curve provides valuable insights into the performance of the home credit default risk classifier, indicating its moderate ability to differentiate between defaulters and non-defaulters. Further improvements could enhance its accuracy and reduce false positives, leading to more informed lending decisions.

In [411]:
convergence.head(5)
Out[411]:
epoch phase roc_auc accuracy CXE Hinge
0 0 train 0.4396 22.661941 2.314936 0.151300
1 0 val 0.1875 20.662424 2.802737 0.099435
2 1 train 0.3841 20.661930 2.943906 0.090852
3 1 val 0.8710 20.655517 3.050676 0.084629
4 2 train 0.5770 20.661143 3.103558 0.081739

Submission File Prep¶

For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:

SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.

Kaggle submission file preperation¶

In [413]:
nn_model.eval()
test_class_scores = nn_model(full_X_kaggle_gpu)
print(test_class_scores[0:10])
tensor([[0.9934],
        [0.9860],
        [0.9717],
        [0.9450],
        [0.9744],
        [0.9436],
        [0.9820],
        [0.9635],
        [0.9494],
        [0.9996]], grad_fn=<SliceBackward0>)
In [414]:
  #For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. 
  fs_type = "Multilayer_nn1"
  submit_df = datasets["application_test"][['SK_ID_CURR']]
  submit_df['TARGET'] = test_class_scores.detach().cpu().numpy()
  print(submit_df.head(2))
  submit_df.to_csv(f'C:/Users/tanub/Courses/AML526/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk/Phase2/submission_{fs_type}.csv',index=False)
   SK_ID_CURR    TARGET
0      100001  0.993352
1      100005  0.985980

report submission¶

Click on this link

image-3.png

Write-up¶

Abstract for Phase 4¶

Abstract for Phase 4: Deep Learning Model Development and Kaggle Submission

In Phase 4, we focus on data preparation, constructing a foundational single-layer neural network, and advancing to a deep neural network for enhanced predictive capabilities. The choice between Hinge and Cross-Entropy Loss functions is carefully considered, aligning with the dataset characteristics. Our model is meticulously built, incorporating activation functions and regularization techniques, followed by comprehensive training. The culmination involves submitting predictions to Kaggle, evaluating the model's performance, and fine-tuning for optimal results. This phase signifies a pivotal transition to deep learning methodologies, showcasing our model's practical utility in predicting credit risk. Through Kaggle, we aim to contribute valuable insights to the data science community and benchmark our model against industry standards.

In Phase 1,2,3, our approach encompasses a robust feature engineering initiative, extending beyond conventional features to introduce novel variables and optimize existing ones. This phase involves the deployment of multiple experimental models, incorporating both original and engineered features to comprehensively evaluate their performance. Subsequent hyperparameter tuning seeks to fine-tune model configurations, aiming for optimal predictive accuracy. The culmination of Phase 3 involves preparing and submitting our model predictions to Kaggle, aligning with a holistic strategy to refine, enhance, and competitively position our models in the evolving landscape of the competition.

Introduction¶

Home Credit, an international non-bank financial institution, prioritizes providing loans to individuals regardless of their credit history, aiming to offer a positive borrowing experience to those not served by traditional sources. To address unfair loan rejection, Home Credit Group released a Kaggle dataset. The project objective is to construct a machine learning model predicting customer loan repayment behavior. We will create a pipeline for a baseline logistic regression classification model, evaluating its performance with metrics like Confusion Matrix, Accuracy Score, Precision, Recall, F1 Score, and AUC. The refined model aims to identify default risk, ensuring deserving clients are approved with suitable terms, empowering them for success. The best-performing pipeline will be submitted to the HCDR Kaggle Competition.

image.png

Feature Engineering and transformers¶

Our feature engineering endeavors encompassed several key aspects, delineated as follows:

Incorporating Domain-Specific Insights: The integration of custom domain knowledge played a pivotal role in the formulation of unique features tailored to our dataset.

Crafting Engineered Aggregated Features: A deliberate effort was made to create novel aggregated features through meticulous engineering, enhancing the dataset's overall representational capacity.

Exploratory Modeling of the Data: We delved into experimental modeling techniques, aiming to uncover hidden patterns and relationships within the dataset that might have eluded conventional analysis.

Validation of Manual One-Hot Encoding (OHE): Rigorous validation processes were applied to ensure the accuracy and effectiveness of manually applied One-Hot Encoding, a critical step in categorical data representation.

Polynomial Feature Expansion (Degree 4): A sophisticated approach involved the generation of polynomial features up to the fourth degree for select variables, amplifying the complexity and richness of the feature set.

Comprehensive Dataset Merging: All relevant datasets were systematically merged, fostering a holistic view of the data and promoting comprehensive analyses.

Pruning Columns with Missing Values: To enhance the dataset's integrity, columns with missing values were judiciously identified and subsequently removed, streamlining the dataset for further analysis.

A pivotal step in the feature engineering process involves the integration of domain knowledge-based features, a critical factor in enhancing model accuracy. Initially, we undertook the task of identifying these features for each dataset. Among the novel custom features introduced were metrics such as post-payment credit card balance relative to the due amount, average application amount, credit average, available credit as a percentage of income, annuity as a percentage of income, and annuity as a percentage of available credit.

Subsequently, we delved into numerical feature identification and aggregation, employing mean, minimum, and maximum values. Although an attempt was made to implement label encoding for unique values exceeding 5 during the engineering phase, a strategic decision led to the application of One-Hot Encoding (OHE) at the pipeline level. This targeted specific highly correlated fields in the final merged dataset, optimizing code management.

Extensive feature engineering was executed through multiple modeling approaches, involving primary, secondary, and tertiary tables, culminating in an optimized approach with minimal memory usage. The first attempt focused on creating engineered and aggregated features for Key-Level 3 tables, merging them with Key-Level 2 tables, and ultimately combining them with the primary dataset. However, this approach resulted in a surplus of redundant features, consuming significant memory.

In Attempt 2, a streamlined approach was adopted, creating custom and aggregated features for Key-Level 3 tables, merging them with Key-Level 2 tables based on the primary key, and extending this to Key-Level 1 tables using additional aggregated columns. This approach reduced duplicates, optimized memory usage, and employed a garbage collector after each merge.

In Attempt 3, the merged dataframe from the previous attempt was further enriched with polynomial features of degree 4. A final merge of Key-Level 3, Key-Level 2, and Key-Level 1 datasets formed the training dataframe, with meticulous attention to ensuring that no columns had more than 50% missing data.

The process of engineering and incorporating these features into the model, coupled with judicious splits during testing, initially yielded lower accuracy. However, deploying these merged features with well-considered splits during the training phase resulted in improved accuracy and diminished risk of overfitting, especially notable in models like Random Forest and XGBoost.

Future endeavors include implementing label encoding for all unique categorical values, exploring techniques such as PCA or custom functions to address multicollinearity in the pipeline, eliminating low-importance features, and evaluating their impact on model accuracy.

Pipelines¶

The logistic regression model serves as our foundational approach due to its ease of implementation and high efficiency, requiring modest computational resources. We fine-tuned essential hyperparameters, including regularization, tolerance, and C, for the logistic regression model, assessing the outcomes against the baseline. Employing 4-fold cross-validation, we leveraged hyperparameter tuning through the Sklearn GridSearchCV function to optimize model performance.

For the Decision Tree model, we adopted a foundational approach leveraging its interpretability and simplicity. Through exhaustive grid search using Sklearn's GridSearchCV, key hyperparameters such as maximum depth, minimum samples split, and minimum samples leaf were fine-tuned. Utilizing 4-fold cross-validation, we systematically optimized the Decision Tree's configuration, evaluating performance enhancements against the baseline.

The Random Forest model was chosen for its robustness and ensemble capabilities. Employing GridSearchCV, we fine-tuned crucial hyperparameters like the number of estimators, maximum depth, and minimum samples split. With 4-fold cross-validation, we iteratively optimized the Random Forest's settings, comparing outcomes to the baseline for improved predictive accuracy.

In the case of the XGBClassifier, a powerful gradient boosting algorithm, we conducted meticulous hyperparameter tuning using GridSearchCV. Parameters such as learning rate, maximum depth, and subsample were optimized to enhance the model's performance. Employing 4-fold cross-validation, we systematically refined the XGBClassifier's configuration, aiming for superior predictive capabilities.

For Bagging, a versatile ensemble method, we harnessed its ability to reduce overfitting and enhance stability. Through GridSearchCV, we fine-tuned parameters such as the number of base estimators and maximum samples. Using 4-fold cross-validation, we strategically optimized Bagging's hyperparameters, gauging improvements in model performance relative to the baseline.

Each model underwent rigorous tuning, balancing computational efficiency and performance gains, with cross-validation ensuring robustness in the optimization process.

image-2.png

  1. Achieve class balance in the "Default" target by resampling and implement cross-fold validation for data splitting.

  2. Develop a data pipeline encompassing 277 feature, selected through aggregation and feature engineering.

  3. Address missing numerical attributes by imputing mean values and handle missing categorical values with the most frequent values.

  4. Utilize FeatureUnion to seamlessly combine both numerical and categorical features within the pipeline.

  5. Construct a model incorporating the data pipeline and a baseline model, assessing performance on both balanced and imbalanced training datasets.

  6. Evaluate the model using accuracy score, F1 score, log loss, and AUC score for training, validation, and test sets, and record results in a dataframe.


Neural Networks for Home Credit Default Risk Prediction¶

This project explores the application of neural networks in predicting credit default risk for home loans. Home Credit Default Risk (HCDR) is a critical concern for financial institutions, and accurate prediction models can aid in making informed lending decisions. Traditional credit scoring models often fall short in capturing complex patterns within diverse datasets.

In this study, we leverage the power of neural networks, specifically deep learning architectures, to enhance the accuracy of credit risk assessment. We employ a dataset from Home Credit, consisting of various socio-economic and financial features. The neural network model is designed to automatically learn intricate relationships and dependencies within the data, allowing for more robust risk predictions.

image-15.png

The project includes the following key components:

  1. Data Preprocessing: Cleaning and feature engineering to prepare the dataset for neural network training.

  2. Neural Network Architecture: Designing a deep learning model tailored for credit risk prediction, with appropriate layers, activation functions, and optimization algorithms.

  3. Training and Validation: Utilizing historical data to train the neural network and validating its performance on a separate dataset to ensure generalization.

  4. Evaluation Metrics: Employing standard metrics such as accuracy, precision, recall, and the area under the ROC curve to assess the model's effectiveness.

  5. Interpretability: Exploring methods to interpret the neural network's decisions, providing insights into the factors contributing to credit default risk.

The outcomes of this project aim to contribute to the development of more sophisticated and accurate credit risk models, potentially improving the decision-making processes for financial institutions in the context of home lending.


Hyperparameter Tuning¶

we perform hyperparameter tuning for an XGBClassifier using GridSearchCV. The dataset is first balanced using SMOTE to address class imbalance. Subsequently, the training data is randomly sampled to expedite the grid search process, selecting 50% of the balanced data. We focus on tuning essential parameters for the XGBClassifier, including the number of estimators and learning rate.

The XGBClassifier is instantiated with a binary logistic objective function, and the hyperparameter grid consists of varying values for the number of estimators (300, 400) and learning rates (0.1, 0.05). The grid search is executed with 3-fold cross-validation, optimizing for recall as the scoring metric. The process is parallelized with three jobs for efficiency.

After fitting the grid search to the training data, the best estimator and corresponding recall score are printed. This approach ensures that the XGBClassifier is fine-tuned for optimal performance on the given classification task.

  1. Best Estimator:
    • Model: XGBClassifier
    • Hyperparameters:
      • learning_rate: 0.1
      • n_estimators: 400

Importance:

Class Imbalance Handling: SMOTE is employed to address the class imbalance issue, generating synthetic samples for the minority class and ensuring a more balanced dataset.

Computational Efficiency: To expedite the hyperparameter tuning process, a randomly sampled subset of the balanced data is used, optimizing computational resources without compromising the quality of the tuning.

Objective Function and Parameters: The XGBClassifier is configured with a binary logistic objective function, suitable for binary classification tasks. The hyperparameter grid is strategically chosen, focusing on critical parameters such as the number of estimators and learning rate.

GridSearchCV: Sklearn's GridSearchCV efficiently explores a range of hyperparameter combinations, selecting the optimal configuration based on the specified scoring metric (recall in this case). The 3-fold cross-validation ensures robust evaluation.

Parallelization: The grid search process is parallelized with three jobs (n_jobs=3), leveraging available computational resources for faster parameter optimization.

Results Interpretation: The best estimator and its corresponding recall score are printed, providing insights into the configuration that maximizes the model's ability to capture positive instances. This informs further model refinement and enhances predictive performance.

Experimental results¶

The XGBClassifier model exhibits excellent performance on the classification task, achieving a high F1 score of 0.95, indicating a robust balance between precision and recall. The model demonstrates strong predictive accuracy with 96%, effectively identifying positive and negative cases. Notably, the recall stands at an impressive 92%, highlighting the model's proficiency in capturing actual positive instances. Additionally, the ROC AUC score of 49.80% reinforces the model's discriminative capability. With a minimal false positive rate (0.14%) and false negative rate (4.25%), the XGBClassifier showcases superior predictive accuracy and reliability in both positive and negative predictions.

image-11.png

Neural Networks¶

Based on the information provided, the ROC curve depicts the performance of a home credit default risk classifier that employs multilayer neural networks with hinge and cross-entropy loss functions. The classifier's ability to discriminate between borrowers who will repay their loans (true positives) and those who will default (false positives) is evaluated using the ROC curve.

The ROC curve indicates that the classifier achieves a true positive rate (TPR) of 0.8 and a false positive rate (FPR) of 0.2 at the point where the curve intersects the 0.5 line. This implies that the classifier accurately identifies 80% of borrowers who will default while erroneously identifying 20% of borrowers who will repay.

The Area Under the Curve (AUC) of the ROC curve, which measures the overall performance of the classifier, is 0.536. A higher AUC indicates better performance, and the AUC of 0.536 suggests moderate performance.

In essence, the ROC curve demonstrates that the classifier exhibits moderate capability in differentiating between borrowers who will default and those who will repay. However, there is still room for improvement.

This suggests that the classifier effectively identifies defaulters but is also prone to false positives, potentially leading to the rejection of creditworthy borrowers.

The decision to utilize this classifier would depend on the specific context. For instance, if the consequences of default are severe, a higher FPR might be acceptable to prevent missing defaulters. However, if the consequences are less severe or if false positives incur substantial costs, a classifier with a lower FPR might be preferable.

In conclusion, the ROC curve provides valuable insights into the performance of the home credit default risk classifier, indicating its moderate ability to differentiate between defaulters and non-defaulters. Further improvements could enhance its accuracy and reduce false positives, leading to more informed lending decisions. image-18.png

What Worked Well:

  1. High F1 Score (0.95): The XGBClassifier model demonstrated excellent performance with a high F1 score of 0.95. This indicates a robust balance between precision and recall, showcasing the model's ability to effectively identify positive and negative cases.

  2. Strong Predictive Accuracy (96%): The model achieved an impressive overall accuracy of 96%, indicating its effectiveness in making correct predictions across both positive and negative instances.

  3. High Recall (92%): The recall rate of 92% is particularly noteworthy, highlighting the model's proficiency in capturing actual positive instances. This is crucial, especially in scenarios where correctly identifying positive cases is of utmost importance.

  4. Low False Positive and False Negative Rates: The minimal false positive rate (0.14%) and false negative rate (4.25%) suggest that the model exhibits superior predictive accuracy and reliability in both positive and negative predictions.

  5. Discriminative Capability (ROC AUC of 49.80%): Despite the ROC AUC score being 49.80%, the model still demonstrates discriminative capability. The ROC AUC score, while not exceptionally high, indicates that the model is effective at distinguishing between classes.

What Surprisingly Did Not Work Well:

  1. ROC AUC for Training Set (0.536): The ROC AUC score for the training set is 0.536, suggesting only a moderate ability of the model to discriminate between defaulters and non-defaulters. This may indicate that the model's performance on the training set does not fully generalize to unseen data.

  2. Low Accuracy on Training Set (20.66%): The accuracy on the training set is surprisingly low at 20.66%. This implies that the model's predictions align with the true labels for only approximately one-fifth of the training set. This could be an indication of overfitting or issues with generalization.

  3. Cross-Entropy Loss (CXE) of 3.4072: The Cross-Entropy Loss (CXE) of 3.4072 reflects the average difference between predicted and actual probabilities. While the specific context of your problem domain may influence the interpretation, a high CXE could suggest that the model's predicted probabilities diverge significantly from the actual probabilities, indicating room for improvement in calibration.

In summary, the model performs exceptionally well in terms of F1 score, overall accuracy, recall, and low false positive/negative rates. However, there are concerns related to its discriminative capability on the training set, low accuracy on the training set, and the Cross-Entropy Loss, suggesting potential areas for further investigation and model refinement.

Gap Analysis of Best Pipeline Against Other Submissions:

Our best-performing pipeline utilizes XGBoost with an achieved score of 0.738. Let's compare this against other submissions:

  1. Logistic Regression (0.764 Kaggle Submission Score):

    • Gap Analysis: The Logistic Regression model from other submissions outperformed our XGBoost model with a Kaggle submission score of 0.764.
    • Potential Reasons for Gap:
      • Logistic Regression might be well-suited for this specific dataset, capturing linear relationships effectively.
      • Feature engineering or data preprocessing steps in other submissions could be more tailored to the characteristics of the dataset.
  2. Neural Network (0.74961):

    • Gap Analysis: The Neural Network model from other submissions achieved a score of 0.74961, surpassing our XGBoost model.
    • Potential Reasons for Gap:
      • Neural Networks are capable of capturing complex non-linear relationships that may be present in the data.
      • Hyperparameter tuning or architecture choices in the Neural Network from other submissions may have been more effective.

Analysis of Our Best Pipeline:

  • Feature Preprocessing:

    • Categorical Pipeline: Our categorical pipeline involves imputing missing values with the most frequent value and applying one-hot encoding. This is a common preprocessing approach for categorical features.
    • Numerical Pipeline: For numerical features, missing values are imputed with the mean, and standard scaling is applied. Again, this is a standard preprocessing method.
    • Feature Union: The two pipelines (categorical and numerical) are combined using FeatureUnion.
  • Model Choice:

    • XGBoost: We chose XGBoost as the main model for this pipeline. XGBoost is a powerful gradient boosting algorithm known for its efficiency and performance.

Discussion of results¶

This table is a summary of the performance of an XGBClassifier model on a classification task. The model was tuned to improve its performance. The following are some interpretations of the table

Recall: This metric measures the proportion of actual positive cases that are correctly predicted by the model. In this case, 92% of the actual positive cases were correctly predicted by the model.

F1: This metric is a harmonic mean of precision and recall, and it is often used to evaluate the performance of classification models. In this case, the F1 score of the model is 0.95, which is very good.

Accuracy: This metric measures the proportion of all predictions that are correct. In this case, 96% of all predictions made by the model were correct.

ROC AUC Score: This metric measures the area under the receiver operating characteristic (ROC) curve, which is a plot of the model's true positive rate versus its false positive rate. In this case, the ROC AUC score of the model is 49.80%, which is very good.

True Negative: This metric counts the number of cases where the model correctly predicted that a case was negative. In this case, the model correctly predicted that 99.86% of the cases were negative. False Positive: This metric counts the number of cases where the model incorrectly predicted that a case was positive. In this case, the model incorrectly predicted that 0.14% of the cases were positive. False Negative: This metric counts the number of cases where the model incorrectly predicted that a case was negative. In this case, the model incorrectly predicted that 4.25% of the cases were negative. True Positive: This metric counts the number of cases where the model correctly predicted that a case was positive. In this case, the model correctly predicted that 45.81% of the cases were positive.

Overall, the XGBClassifier model is performing very well on this classification task. It has a high precision, recall, F1 score, accuracy, and ROC AUC score. It also has a low false positive rate and a low false negative rate. This means that the model is very good at both predicting positive cases and predicting negative cases.

image-13.png Tuned Experiment image-14.png

neural Networks with Multilayered Approach¶

The results from Epoch 24/25 in training and validation phases for the Home Credit Default Risk (HCDR) using a multi-layer neural network with cross-entropy (CXE) and hinge loss functions provide insights:

Performance Metrics:

The ROC_AUC for the training set is 0.536, suggesting a moderate ability of the model to discriminate between defaulters and non-defaulters. The accuracy is 20.66%, indicating that the model's predictions align with the true labels for approximately one-fifth of the training set. Cross-Entropy Loss (CXE) is 3.4072, reflecting the average difference between predicted and actual probabilities. Hinge Loss Insights:

The Hinge Loss for the training set is 0.0696, emphasizing the model's focus on correctly classifying instances near the decision boundary. The low Hinge Loss suggests that the model is penalizing misclassifications with a margin less than 1. Validation Performance:

The ROC_AUC for the validation set is 0.4889, indicating a similar discriminative ability but potentially lower generalization compared to the training set. The accuracy on the validation set is 20.65%, consistent with the training set but highlighting potential limitations in predictive power. The close values between training and validation metrics suggest that the model is not overfitting the training data. Decision Threshold:

The Decision Threshold (DT) values for training and validation are 3.0 and 0.4, respectively. This implies a difference in the threshold for predicting positive outcomes, affecting the balance between true positives and false positives. Areas for Improvement:

The relatively low ROC_AUC and accuracy values suggest that the model might benefit from further refinement or feature engineering to capture more complex patterns in the data. The decision threshold discrepancy between training and validation could be explored to optimize the trade-off between sensitivity and specificity. In summary, while the multi-layer neural network shows a reasonable ability to identify default risk, there is room for improvement in terms of discriminative power and generalization. Further analysis and potential model adjustments may enhance predictive performance for Home Credit Default Risk assessment.

image-17.png

Conclusion¶

The results table presents a comprehensive overview of the performance of various machine learning models on an HCDR classification task, with a focus on metrics such as accuracy, AUC, F1 score, and loss. The findings highlight several key points:

  1. Top Performing Models:

    • BaggingClassifier_with_advanced_features, Oversampled_RandomForest_with_advanced_features, and XgBoost with best Hyperparameters stand out as the top performers, achieving high accuracy, AUC, and F1 score on the test set.
    • These models also exhibit low loss, indicating robust performance across multiple evaluation metrics.
  2. Feature Importance:

    • Models with advanced features consistently outperform those with basic features, emphasizing the importance of feature engineering in enhancing model performance.
  3. Oversampling Effectiveness:

    • The oversampling technique proves effective, particularly in improving the performance of logistic regression and decision tree models, as evidenced by their enhanced accuracy, AUC, and F1 score.
  4. Ensemble Learning Advantage:

    • Ensemble learning models, such as bagging, random forest, and boosting, demonstrate superior performance over individual models, reinforcing the effectiveness of combining multiple models for better results.
  5. XgBoost Model Dominance:

    • The XgBoost model with optimized hyperparameters emerges as the overall best-performing model, showcasing high accuracy, AUC, and F1 score.

Regarding the XGBClassifier model specifically:

  • Recall: The model correctly predicts 92% of actual positive cases, indicating strong sensitivity.
  • F1 Score: With an F1 score of 0.95, the model achieves a harmonious balance between precision and recall, signifying robust overall performance.
  • Accuracy: The model accurately predicts 96% of all cases, showcasing its reliability in making correct predictions.
  • ROC AUC Score: The model achieves a high ROC AUC score of 49.80%, reflecting its ability to distinguish between positive and negative cases effectively.
  • Confusion Matrix Analysis: The model demonstrates excellent performance in true negative predictions (99.86%), low false positive (0.14%) and false negative (4.25%) rates, and a high true positive rate (45.81%).

Conclusion: The XGBClassifier model, tuned for optimal performance, excels across multiple evaluation metrics, including precision, recall, F1 score, accuracy, and ROC AUC score. Its ability to effectively predict both positive and negative cases, coupled with low rates of false positives and false negatives, attests to its reliability and suitability for the HCDR classification task. Overall, the results provide confidence in the model's capability to make accurate predictions and its potential for practical deployment in real-world scenarios.

Recommendations for Improvement:

  1. Exploratory Data Analysis (EDA):

    • Conduct a more in-depth EDA to understand the characteristics of the data, identify outliers, and discover potential relationships.
  2. Feature Engineering:

    • Explore additional feature engineering techniques to create new informative features or transformations that might better capture patterns in the data.
  3. Hyperparameter Tuning:

    • Optimize hyperparameters for the XGBoost model. Fine-tuning parameters like learning rate, maximum depth, and subsample could lead to improved performance.
  4. Model Comparison:

    • Experiment with other models or ensemble methods to see if a different algorithm might better suit the dataset.
  5. Cross-Validation:

    • Ensure robust cross-validation to obtain a more reliable estimate of model performance.
  6. Kaggle Forum and Discussions:

    • Participate in Kaggle forums and discussions to gain insights from the community on specific challenges or strategies for improving model performance on this dataset.
  7. Ensemble Methods:

    • Explore the possibility of combining predictions from multiple models using ensemble methods to enhance predictive accuracy.

By addressing these recommendations and learning from the successful strategies of other submissions, we aim to narrow the performance gap and potentially surpass the current best scores.

Kaggle Submission¶

image-19.png

References¶

Some of the material in this notebook has been adopted from here

References

  1. https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction/notebook
  2. https://towardsdatascience.com/a-machine-learning-approach-to-credit-risk-assessment-ba8eda1cd11f
  3. https://juhiramzai.medium.com/introduction-to-credit-risk-modeling-e589d6914f57
  4. https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html
  5. https://stackoverflow.com/questions/28930465/what-is-the-difference-between-flatten-and-ravel-functions-in-numpy

  6. https://www.analyticsvidhya.com/blog/2020/10/7-feature-engineering-techniques-machine-learning/

  7. https://www.geeksforgeeks.org/append-extend-python/
  8. https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
  9. https://medium.com/mindorks/what-is-feature-engineering-for-machine-learning-d8ba3158d97a

  10. https://medium.com/analytics-vidhya/what-is-multicollinearity-and-how-to-remove-it-413c419de2f

  11. https://github.com/Anitha-Ganapathy/Home-Credit-Default-Risk-AML-Project/blob/main/Group1_Phase3_PyTorch%20Deep%20Learning.ipynb

TODO: Predicting Loan Repayment with Automated Feature Engineering in Featuretools¶

Read the following:

  • feature engineering via Featuretools library:
    • https://github.com/Featuretools/predict-loan-repayment/blob/master/Automated%20Loan%20Repayment.ipynb
  • https://www.analyticsvidhya.com/blog/2018/08/guide-automated-feature-engineering-featuretools-python/
  • feature engineering paper: https://dai.lids.mit.edu/wp-content/uploads/2017/10/DSAA_DSM_2015.pdf
  • https://www.analyticsvidhya.com/blog/2017/08/catboost-automated-categorical-data/